# dataefficient_structured_pruning_via_submodular_optimization__9b1c41ee.pdf

Data-Efﬁcient Structured Pruning via

Submodular Optimization

Marwa El Halabi Samsung - SAIT AI Lab, Montreal

Suraj Srinivas Harvard University

Simon Lacoste-Julien Mila, Université de Montreal Samsung - SAIT AI Lab, Montreal

Canada CIFAR AI Chair

Structured pruning is an effective approach for compressing large pre-trained neural networks without signiﬁcantly affecting their performance. However, most current structured pruning methods do not provide any performance guarantees, and often require ﬁne-tuning, which makes them inapplicable in the limited-data regime. We propose a principled data-efﬁcient structured pruning method based on submodular optimization. In particular, for a given layer, we select neurons/channels to prune and corresponding new weights for the next layer, that minimize the change in the next layer s input induced by pruning. We show that this selection problem is a weakly submodular maximization problem, thus it can be provably approximated using an efﬁcient greedy algorithm. Our method is guaranteed to have an exponentially decreasing error between the original model and the pruned model outputs w.r.t the pruned size, under reasonable assumptions. It is also one of the few methods in the literature that uses only a limited-number of training data and no labels. Our experimental results demonstrate that our method outperforms state-of-the-art methods in the limited-data regime.

1 Introduction

As modern neural networks (NN) grow increasingly large, with some models reaching billions of parameters [Mc Gufﬁe and Newhouse, 2020], they require an increasingly large amount of memory, power, hardware, and inference time, which makes it necessary to compress them. This is especially important for models deployed on resource-constrained devices like mobile phones and smart speakers, and for latency-critical applications such as self-driving cars.

Several approaches exist to compress NNs. Some methods approximate model weights using quantization and hashing [Gong et al., 2014, Courbariaux et al., 2015], or low-rank approximation and tensor factorization [Denil et al., 2013, Lebedev et al., 2015, Su et al., 2018]. In another class of methods called knowledge distillation, a small network is trained to mimic a much larger network [Bucila et al., 2006, Hinton et al., 2015]. Other methods employ sparsity and group-sparsity regularisation during training, to induce sparse weights [Collins and Kohli, 2014, Voita et al., 2019].

In this work, we follow the network pruning approach, where the redundant units (weights, neurons or ﬁlters/channels) of a pre-trained NN are removed; see [Kuzmin et al., 2019, Blalock et al., 2020,

work done partially at MIT, CSAIL. work done partially at Idiap Research Institute, Switzerland.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Hoeﬂer et al., 2021] for recent surveys. We also focus on the limited-data regime, where only few training data is available and data labels are unavailable. The advantage of pruning approaches is that, unlike weights approximation-based methods, they preserve the network structure, allowing retraining after compression, and unlike training-based approaches, they do not require training from scratch, which is costly and requires large training data. It is also possible to combine different compression approaches to compound their beneﬁts, see e.g., [Kuzmin et al., 2019, Section 4.3.4].

Existing pruning methods fall into two main categories: unstructured pruning methods which prune individual weights leading to irregular sparsity patterns, and structured pruning methods which prune regular regions of weights, such as neurons, channels, or attention heads. Structured pruning methods are generally preferable as the resulting pruned models can work with off-the-shelf hardware or kernels, as opposed to models pruned with unstructured pruning which require specialized ones.

The majority of existing structured pruning methods are heuristics that do not offer any theoretical guarantees. Moreover, most pruning methods are inapplicable in the limited-data regime, as they rely on ﬁne-tuning with large training data for at least a few epochs to recover some of the accuracy lost with pruning. Mariet and Sra [2015] proposed a reweighting" procedure applicable to any pruning method, which optimize the remaining weights of the next layer to minimize the change in the input to the next layer. Their empirical results on pruning single linear layers suggest that reweighting can provide a similar boost to performance as ﬁne-tuning, without the need for data labels.

Our contributions We propose a principled data-efﬁcient structured pruning method based on submodular optimization. In each layer, our method simultaneously selects neurons to prune and new weights for the next layer, that minimize the change in the next layer s input induced by pruning. The optimization with respect to the weights, for a ﬁxed selection of neurons, is the same one used for reweighting in [Mariet and Sra, 2015]. The resulting subset selection problem is intractable, but we show that it can be formulated as a weakly submodular maximization problem (see Deﬁnition 2.1). We can thus use the standard greedy algorithm to obtain a (1 e γ)-approximation to the optimal solution, where γ is non-zero if we use sufﬁcient training data. We further adapt our method to prune any regular regions of weights; we focus in particular on pruning channels in convolution layers. To prune multiple layers in the network, we apply our method to each layer independently or sequentially.

We show that the error induced by pruning with our method on the model output decays with an O(e γk) rate w.r.t the number k of neurons/channels kept, under reasonable assumptions. Our method uses only limited training data and no labels. Similar to [Mariet and Sra, 2015], we observe that reweighting provides a signiﬁcant boost in performance not only to our method, but also to other baselines we consider. However unlike [Mariet and Sra, 2015], we only use a small fraction of the training data, around 1% in our experiments. Our experimental results demonstrate that our method outperforms state-of-the-art pruning methods, even when reweighting is applied to them too, in the limited-data regime, and it is among the best performing methods in the standard setting.

Related work A large variety of structured pruning approaches has been proposed in the literature, based on different selection schemes and algorithms to solve them. Some works prune neurons/channels individually based on some importance score [He et al., 2014, Li et al., 2017, Liebenwein et al., 2020, Mussay et al., 2020, 2021, Molchanov et al., 2017, Srinivas and Babu, 2015]. Such methods are efﬁcient and easy to implement, but they fail to capture higher-order interactions between the pruned parameters. Most do not provide any performance guarantee. One exception are the sampling-based methods of [Liebenwein et al., 2020, Mussay et al., 2020, 2021], who show an O(1/k) error rate, under some assumptions on the model activations.

Closer to our approach are methods that aim to prune neurons/channels that minimize the change induced by pruning in the output of the layer being pruned, or its input to the next layer [Luo et al., 2017, He et al., 2017, Zhuang et al., 2018, Ye et al., 2020b]. These criteria yield an intractable combinatorial problem. Existing methods either use a heuristic greedy algorithm to solve it [Luo et al., 2017, Zhuang et al., 2018], or they solve instead its 1-relaxation using alternating minimization [He et al., 2017], or a greedy algorithm with Frank-Wolfe like updates [Ye et al., 2020b]. Among these works only [Ye et al., 2020b] provides theoretical guarantees, showing an O(e ck) error rate. Their method is more expensive than ours, and only optimize the scaling of the next layer weights instead of the weights themselves. A global variant of this method is proposed in Ye et al. [2020a,b], which aim to prune neurons/channels that directly minimize the loss of the pruned network. A similar greedy algorithm with Frank-Wolfe like updates is used to solve the 1-relaxation of the selection problem. This method has an O(1/k2) error rate and is very expensive, as it requires a full forward

pass through the network at each iteration. See Appendix A for a more detailed comparison of our method with those of [Ye et al., 2020a,b].

Mariet and Sra [2015] depart from the usual strategy of pruning parameters whose removal inﬂuences the network the least. They instead select a subset of diverse neurons to keep in each layer by sampling from a Determinantal Point Process, then they apply their reweighting procedure. Their experimental results show that the advantage of their method is mostly due to reweighting (see Figure 4 therein).

2 Preliminaries

We begin by introducing our notation and some relevant background from submodular optimization.

Notation: Given a ground set V = {1, 2, , d} and a set function F : 2V ! R+, we denote the marginal gain of adding a set I V to another set S V by F(I | S) = F(S [ I) F(S), which quantiﬁes the change in value when adding I to S. The cardinality of a set S is written as |S|. Given a vector x 2 Rd, we denote its support set by supp(x) = {i 2 V |xi 6= 0}, and its 2-norm by kxk2. Given a matrix X 2 Rd0 d, we denote its i-th column by Xi, and its Frobenius norm by k Xk F . Given a set S V , XS is the matrix with columns Xi for all i 2 S, and 0 otherwise, and 1S is the indicator vector of S, with [1S]i = 1 for all i 2 S, and 0 otherwise.

Algorithm 1 GREEDY

1: Input: Ground set V , set function F : 2V ! R+, budget k 2 N+ 2: S ; 3: while |S| < k do 4: i arg maxi2V \S F(i | S) 5: S S [ {i } 6: end while 7: Output: S

Weakly submodular maximization: A set function F is submodular if it has diminishing marginal gains: F(i | S) F(i | T) for all S T, i 2 V \ T. We say that F is normalized if F(;) = 0, and non-decreasing if F(S) F(T) for all S T. Given a non-decreasing submodular function F, selecting a set S V with cardinality |S| k that maximize F(S) can be done efﬁciently using the GREEDY algorithm (Alg. 1). The returned solution is guaranteed to satisfy F( ˆS) (1 1/e) max|S| k F(S) [Nemhauser et al., 1978]. In general though maximizing a non-submodular function over a cardinality constraint is NP-Hard [Natarajan, 1995]. However, Das and Kempe [2011] introduced a notion of weak submodularity which is sufﬁcient to obtain a constant factor approximation with the GREEDY algorithm. Deﬁnition 2.1. Given a set function F : 2V ! R, U V, k 2 N+, we say that F is γU,k-weakly submodular, with γU,k > 0 if

γU,k F(S|L)

for every two disjoint sets L, S V , such that L U, |S| k.

The parameter γU,k is called the submodularity ratio of F. It characterizes how close a set function is to being submodular. If F is non-decreasing then γU,k 2 [0, 1], and F is submodular if and only if γU,k = 1 for all U V, k 2 N+. Given a non-decreasing γ ˆS,k-weakly submodular function F, the

Greedy algorithm is guaranteed to return a solution ˆS satisfying F( ˆS) (1 e γ ˆ S,k) max|S| k F(S) [Elenberg et al., 2016, Das and Kempe, 2011]. Hence, the closer F is to being submodular, the better is the approximation guarantee.

3 Reweighted input change pruning

In this section, we introduce our approach for pruning neurons in a single layer. Given a large pretrained NN, n training data samples, and a layer with n neurons, our goal is to select a small number

k out of the n neurons to keep, and prune the rest, in a way that inﬂuences the network the least. One way to achieve this is by minimizing the change in input to the next layer + 1, induced by pruning. However, simply throwing away the activations from the dropped neurons is wasteful. Instead, we optimize the weights of the next layer to reconstruct the inputs from the remaining neurons.

Formally, let A 2 Rn n be the activation matrix of layer with columns a

n , where a

i 2 Rn is the vector of activations of the ith neuron in layer for each training input, and let W +1 2 Rn n +1 be the weight matrix of layer + 1 with columns w +1

n +1, where w +1

i 2 Rn is the vector of weights connecting the ith neuron in layer + 1 to the neurons in layer . When a neuron is pruned in layer , the corresponding column of weights in W and row in W +1 are removed. Pruning n k neurons in layer reduces the number of parameters and computation cost by (n k)/n for both layer and + 1.

Let V = {1, , n }. Given a set S V , we denote by A

S the matrix with columns a

i for all i 2 S, and 0 otherwise. That is, A

S is the activation matrix of layer after pruning. We choose a set of neurons S V to keep and new weights W +1 2 Rn n +1 that minimize:

min |S| k, W +12Rn n +1 k A W +1 A

Note that A W +1 are the original inputs of layer l + 1, and A

S W +1 are the inputs after pruning and reweighting, i.e., replacing the weights W +1 of layer + 1 with the new weights W +1.

3.1 Greedy selection

Solving Problem (1) exactly is NP-Hard [Natarajan, 1995]. However, we show below that it can be formulated as a weakly submodular maximization problem, hence it can be efﬁciently approximated. Let

F(S) = k A W +1k2

W +1 k A W +1 A

then Problem (1) is equivalent to max|S| k F(S). Proposition 3.1. Given U V, k 2 N+, F is a normalized non-decreasing γU,k-weakly submodular function, with

γU,k minkzk2=1,kzk0 |U|+k k A zk2

2 maxkzk2=1,kzk0 |U|+1 k A zk2

The proof of Proposition 3.1 follows by writing F as the sum of n +1 sparse linear regression problems F(S) = Pn +1

m=1 k A w +1

2 minsupp( wm) S k A w +1

2, and from the relation established in [Elenberg et al., 2016, Das and Kempe, 2011] between weak submodularity and sparse eigenvalues of the covariance matrix (see Appendix B.1).

We use the GREEDY algorithm to select a set ˆS V of k neurons to keep in layer . As discussed in Section 2, the returned solution is guaranteed to satisfy

F( ˆS) (1 e γ ˆ S,k) max

|S| k F(S) (3)

Computing the lower bound on the submodularity ratio γ ˆS,k in Proposition 3.1 is NP-Hard [Das and Kempe, 2011]. It is non-zero if any min{2k, n } columns of A are linearly independent. If the number of training data is larger than the number of neurons, i.e., n > n , this is likely to be satisﬁed. We verify that this is indeed the case in our experiments in Appendix E. We also discuss the tightness of the lower bound in Appendix F.

We show in Appendix D that F satisﬁes an even stronger notion of approximate submodularity than weak submodularity, which implies a better approximation guarantee for GREEDY than the one provided in Eq. (3). Though, this requires a stronger assumption: any k + 1 columns of A should be linearly independent and all rows of W +1 should be linearly independent. In particular, we would need that n n +1, which is not always satisﬁed.

In Section 6, we show that the approximation guarantee of Greedy implies an exponentially decreasing bound on the layerwise error, and on the ﬁnal output error under a mild assumption.

3.2 Reweighting

For a ﬁxed S V , the reweighted input change k A W +1 A

F is minimized by setting W +1 = x S(A )W +1, (4)

where x S(A ) 2 Rn n is the matrix with columns x S(a

j) such that

j) 2 arg min

2 for all j 2 V . (5)

Note that the new weights are given by w +1

j62S[x S(A )]ijw +1

jm for all i 2 S, and w +1

im = 0 for all i 62 S, m 2 V +1. Namely, the new weights merge the weights from the dropped neurons into the kept ones. This is the same reweighting procedure introduced in [Mariet and Sra, 2015]. But instead of applying it only at the end to the selected neurons ˆS, it is implicitly done at each iteration of our pruning method, as it is required to evaluate F. We discuss next how this can be done efﬁciently.

Each iteration of GREEDY requires O(n ) function evaluations of F. Computing F(S) from scratch needs O(k (n n +1 +n (n +n +1)) time, so a naive implementation of GREEDY is too expensive. The following Proposition outlines how we can efﬁciently evaluate F(S + i) given that F(S) was computed in the previous iteration. Proposition 3.2. Given S V such that |S| k, i 62 S, let proj S(a

j) be the projection of a

j onto the column space of A

i) and proj RS(a

j) 2 arg minz=RS(a

2 the corresponding residual and the projection of a

j onto it. We can write

where proj RS(ai)(A

V \S) is the matrix with columns proj RS(ai)(a

j) for all j 62 S, 0 otherwise. Assuming F(S), proj S(a

j) and x S(a

j) for all j 62 S were computed in the previous iteration, we can compute F(S + i), proj S+i(a

j) and x S+i(a

j) for all j 62 (S + i) in

O(n (n +1 + n + k)) time. The optimal weights in Eq. (4) can then be computed in O(k n n +1) time, at the end of GREEDY.

The proof is given in Appendix B.2, and relies on using optimality conditions to construct the least squares solution x S+i(a

j) from x S(a

j). In total GREEDY s runtime is then O(k (n )2 (n +1 + n + k)). In other words, our pruning method costs as much as O(k) forward passes in layer + 1 with a batch of size n (assuming n +1 = O(n )). Using a faster variant of GREEDY, called STOCHASTIC-GREEDY [Mirzasoleiman et al., 2015], further reduces the cost to O(log(1/ ) (n )2 (n +1 +n+k)), or equivalently O(log(1/ )) forward passes in layer + 1 with a batch of size n, while maintaining almost the same approximation guarantee (1 e γ ˆ S,k ) in expectation. 3

Note also that computing the solutions for different budgets k0 k can be done at the cost of one by running GREEDY with budget k. Our method is more expensive than methods which prune neurons individually [He et al., 2014, Li et al., 2017, Liebenwein et al., 2020, Mussay et al., 2020, 2021, Molchanov et al., 2017, Srinivas and Babu, 2015], but much less expensive than a loss-based method like [Ye et al., 2020a,b], which requires O(k) forward passes in the full network, for each layer.

4 Pruning regular regions of neurons

In this section, we discuss how to adapt our approach to pruning regular regions of neurons. This is easily achieved by mapping any set of regular regions to the corresponding set of neurons, then applying the same method in Section 3. In particular, we focus on pruning channels in CNNs.

3Mirzasoleiman et al. [2015] only consider submodular functions, but it is straighforward to extend their result to weakly submodular functions Appendix B.3.

Given a layer with n output channels, let X 2 Rn p n rh rw be its activations for each output channel and training input, where p is number of patches obtained by applying a ﬁlter of size rh rw, and let F +1 2 Rn +1 n rh rw be the weights of layer + 1, corresponding to n ﬁlters of size rh rw for each of its output channels. When an output channel is pruned in layer , the corresponding weights in F and F +1 are removed. Pruning n k output channels in layer reduces the number of parameters and computation cost by (n k)/n for both layer and + 1. If layer is followed by a batch norm layer, the weights therein corresponding to the pruned channels are also removed.

We arrange the activations X

c 2 Rn p rh rw of each channel c into rhrw columns of A 2 Rn p n rh rw, i.e., A = [X

n ]. Similarly, we arrange the weights F +1

c 2 Rn +1 rh rw of each channel c into rh rw rows of W +1 2 Rn rh rw n +1, i.e., (W +1)> = [(F

n )>]. Recall that V = {1, , n }, and let V 0

= {1, , rhrwn }. We deﬁne a function M : 2V ! 2V 0

which maps every channel c to its corresponding rhrw columns in A . Let G(S) = F(M(S)), with F deﬁned in Eq. (2), then minimizing the reweighted input change k A W +1 A

M(S) W +1k2

F with a budget k is equivalent to max|S| k G(S). The following proposition shows that this remains a weakly submodular maximization problem. Proposition 4.1. Given U V , k 2 N+, G is a normalized non-decreasing γU,k-weakly submodular function, with

γU,k minkzk2=1,kzk0 rhrw(|U|+k) k A zk2

2 maxkzk2=1,kzk0 rhrw(|U|+1) k A zk2

Proof sketch. G is γU,k-weakly submodular iff F satisﬁes γU,k F(M(S)|M(L)) P

i2S F(M(i)|M(L)), for every two disjoint sets L, S V , such that L U, |S| k. The proof follows by extending the relation established in [Elenberg et al., 2016, Das and Kempe, 2011] between weak submodularity and sparse eigenvalues of the covariance matrix to this case.

As before, we use the GREEDY algorithm, with function G, to select a set ˆS V of k channels to keep in layer . We get the same approximation guarantee G( ˆS) (1 e γ ˆ S,k) max|S| k G(S). The submodularity ratio γ ˆS,k is non-zero if any min{2k, n }rhrw columns of A are linearly independent. In our experiments, we observe that in certain layers linear independence only holds for k very small, e.g., k 0.01n . This is due to the correlation between patches which overlap. To remedy this, we experimented with using only rhrw random patches from each image, instead of using all patches. This indeed raises the rank of A , but certain layers have a very small feature map size so that even the small number of random patches have signiﬁcant overlap, resulting in still a very small range where linear independence holds, e.g., k 0.08n (see Appendix E for more details). The results obtained with random patches were worst than the ones with all patches, we thus omit them. Note that our lower bounds on γ ˆS,k are not necessarily tight (see Appendix F). Hence, having linear dependence does not necessarily imply that γ ˆS,k = 0; our method still performs well in these cases.

For a ﬁxed S V , the optimal weights are again given by W +1 = x M(S)(A )W +1. The cost of running GREEDY and reweighting is the same as before (see Appendix B.2).

5 Pruning multiple layers

In this section, we explain how to apply our pruning method to prune multiple layers of a NN.

5.1 Reweighted input change pruning variants

We consider three variants of our method: LAYERINCHANGE, SEQINCHANGE, and ASYMIN-

CHANGE. In LAYERINCHANGE, we prune each layer independently, i.e., we apply exactly the method in Section 3 or 4, according to the layer s type. This is the fastest variant; it has the same cost as pruning a single layer, as each layer can be pruned in parallel, and it only requires one forward pass to get the activations of all layers. However, it does not take into account the effect of pruning one layer on subsequent layers.

In SEQINCHANGE, we prune each layer sequentially, starting from the earliest layer to the latest one. For each layer , we apply our method with A replaced by the updated activations B after

having pruned previous layers, i.e., we solve min|S| k, W +12Rn n +1 k B W +1 B

F . In ASYMINCHANGE, we also prune each layer sequentially, but to avoid the accumulation of error, we use an asymmetric formulation of the reweighted input change, where instead of approximating the updated input B W +1, we approximate the original input A W +1, i.e., we solve min|S| k, W +12Rn n +1 k A W +1 B

F . This problem is still a weakly submodular maximization problem, with the same submodularity ratio given in Propositions 3.1 and 4.1, with A replaced by B therein (see Appendix B.1). Hence, the same approximation guarantee as in the symmetric formulation holds here. Moreover, a better approximation guarantee can again be obtained under stronger assumptions (see Appendix D). The cost of running GREEDY with the asymmetric formulation and reweighting is also the same as before (see Appendix B.2).

In Section 6, we show that the sequential variants of our method both have an exponential error rate, which is faster for the asymmetric variant. We evaluate all three variants in our experiments. As expected, ASYMINCHANGE usually performs the best, and LAYERINCHANGE the worst.

5.2 Per-layer budget selection

Another important design choice is how much to prune in each layer, given a desired global compression ratio (see Appendix I for the effect of this choice on performance). In our experiments, we use the budget selection method introduced in [Kuzmin et al., 2019, Section 3.4.1], which can be applied to any layerwise pruning method, thus enabling us to have a fair comparison.

Given a network with L layers to prune, let c = original size

pruned size be the desired compression ratio. We want to select for each layer , the number of neurons/channels k = n to keep, with chosen from a ﬁxed set of possible values, e.g., 2 {0.05, 0.1, , 1}. We deﬁne a layerwise accuracy metric P (k ) as the accuracy obtained after pruning layer , with a budget k , while other layers are kept intact, evaluated on a veriﬁcation set. We set aside a subset of the training set to use as a veriﬁcation set. Let Porig be the original model accuracy, Corig the original model size, and C(k1, , k L) the pruned model size. We select the per-layer budgets that minimize the per-layer accuracy drop while satisfying the required compression ratio:

min k1, ,k L{ : 8 2 [L], P (k ) Porig , C(k1, , k L) Corig/c}. (6)

We can solve the selection problem (6) using binary search, if the layerwise accuracy P (k ) is a nondecreasing function of k . Empirically, this is not always the case, the general trend is non-decreasing, but some ﬂuctuations occur. In such cases, we use interpolation to ensure monotonicity.

Alternatively, another simple strategy is to prune each layer until the perlayer error (the reweighted input change in our case) reaches some threshold , and vary to obtain the desired compression ratio, as done in [Zhuang et al., 2018, Ye et al., 2020a].

6 Error convergence rate

In this section, we provide the error rate of our proposed method. The omitted proofs are given in Appendix C. We ﬁrst show that the change in input to the next layer induced by pruning with our method, with both the symmetric and asymmetric formulation, decays with exponentially fast rate.

Proposition 6.1. Let ˆS be the output of the GREEDY algorithm and ˆW +1 the corresponding optimal weights (Eq. (4)), then

F e γ ˆ S,n k/n k A W +1k2

F e γ ˆ S,n k/n k A W +1k2

F +(1 e γ ˆ S,n k/n ) min W +12Rn n +1 k A W +1 B W +1k2

This follows by extending the approximation guarantee of GREEDY in [Elenberg et al., 2016, Das and Kempe, 2011] to F( ˆS) (1 e γ ˆ S,n k/n ) max|S| n F(S). Note that this bounds uses the submodularity ratio γ ˆS,n , for which the lower bound in Proposition 3.1 is non-zero only if all columns of A are linearly independent, which is more restrictive. Though as discussed earlier, this

bound is not necessarily tight. We can further extend this exponential layerwise error rate to an exponentially rate on the ﬁnal output error, if we assume as in [Ye et al., 2020b] that the function corresponding to all layers coming after layer is Lipschitz continuous.

Corollary 6.2. Let y 2 Rn be the original model output, y ˆS 2 Rn the output after layer is pruned using our method, and H the function corresponding to all layers coming after layer , i.e., y = H(A W +1), y ˆS = H(A

ˆS ˆW +1). If H is Lipschitz continuous with constant k Hk Lip, then

2 e γ ˆ S,n k/n k Hk2

Lipk A W +1k2

Proof. Since H is Lipschitz continuous, we have ky y ˆSk2

Lipk A W +1 A

F . The claim then follows from Proposition 6.1.

This matches the exponential convergence rate achieved by the local imitation method in [Ye et al., 2020b, Theorem 1], albeit with a different constant. Under the same assumption, we can show that pruning multiple layers with the sequential variants of our method, SEQINCHANGE and ASYMINCHANGE, also admits an exponential convergence rate:

Corollary 6.3. Let y 2 Rn be the original model output, y ˆS , y S 2 Rn the outputs after layers 1 to are sequentially pruned using SEQINCHANGE and ASYMINCHANGE, respectively, and H the function corresponding to all (unpruned) layers coming after layer . If every function H is Lipschitz continuous with constant k H k Lip, then

e γ ˆ S ,n k /n k H k2

Lipk A W +1k2

γ S 0 ,n 0 k 0/n 0 )e γ S ,n k /n k H k2

Lipk A W +1k2

The result is obtained by iteratively applying Proposition 6.1 to the error incurred after each layer is pruned. The rate of SEQINCHANGE matches the exponential convergence rate achieved by the local imitation method in [Ye et al., 2020b, Theorem 6]. The bound on ASYMINCHANGE is stronger, conﬁrming that the asymmetric formulation indeed reduces the accumulation of errors.

7 Empirical Evaluation

In this section, we examine the performance of our proposed pruning method in the limited-data regime. To that end, we focus on one-shot pruning, in which a pre-trained model is compressed in a single step, without any ﬁne-tuning. We study the effect of ﬁne-tuning with both limited and sufﬁcient data in Appendix H. We compare the three variants of our method, LAYERINCHANGE, SEQINCHANGE, and ASYMINCHANGE, with the following baselines:

LAYERGREEDYFS [Ye et al., 2020a]: for each layer, ﬁrst removes all neurons/channels in that layer, then gradually adds back the neuron/channel that yields the largest decrease of the loss, evaluated on one batch of training data. Layers are pruned sequentially from the input to the output layer. LAYERSAMPLING [Liebenwein et al., 2020]: samples neurons/channels, in each layer, with probabilities proportional to sensitivities based on (activations weights), and prunes the rest. ACTGRAD [Molchanov et al., 2017]: prunes neurons/channels with the lowest (activations gradients), averaged over the training data, with layerwise 2-normalization. LAYERACTGRAD: prunes neurons/channels with the lowest (activations gradients), averaged over the training data, in each layer. This is the layerwise variant of ACTGRAD. LAYERWEIGHTNORM [Li et al., 2017]: prunes neurons/channels with the lowest output weights 1-norm, in each layer. RANDOM: prunes randomly selected neurons/channels globally across layers in the network. LAYERRANDOM: prunes randomly selected neurons/channels in each layer.

We also considered the global variant of LAYERWEIGHTNORM proposed in [He et al., 2014], but we exclude it from plots, as it is always the worst performing method. We evaluate the performance of these methods on the Le Net model [Le Cun et al., 1989] on the MNIST dataset [Lecun et al., 1998], and on the Res Net56 [He et al., 2016] and the VGG11 [Simonyan and Zisserman, 2015] models on the CIFAR-10 dataset [Krizhevsky et al., 2009]. To ensure a fair comparison, all experiments are based on our own implementation of all the compared methods. To compute the gradients and activations used for pruning in LAYERSAMPLING, ACTGRAD, LAYERACTGRAD, and our method s variants, we use four batches of 128 training images, i.e., n = 512, which corresponds to 1% of the training data in MNIST and CIFAR10. We consider two variants of the method proposed in [Ye et al., 2020a]: a limited-data variant LAYERGREEDYFS which only uses the same four batches of data used in our method, and a full-data variant LAYERGREEDYFS-fd with access to the full training data.

We report top-1 accuracy results evaluated on the validation set, as we vary the compression ratio ( original size

pruned size ). Unless otherwise speciﬁed, we use the per-layer budget selection method described in Section 5.2 for all the layerwise pruning methods, except for LAYERSAMPLING for which we use its own budget selection strategy provided in [Liebenwein et al., 2020]. We use a subset of the training set, of the same size as the validation set, as a veriﬁcation set for the budget selection method. To disentangle the beneﬁt of using our pruning method from the beneﬁt of reweighting (Section 3.2), we report results with reweighting applied to all pruning methods, or none of them. Though, we will focus our analysis on the more interesting results with reweighting, with the plots without reweighting mostly serving as a demonstration of the beneﬁt of reweighting. Results are averaged over ﬁve random runs, with standard deviations plotted as error bars. We report the speedup ( original number of FLOPs

pruned number of FLOPs ) and pruning time values in Appendix J. For additional details on the experimental set-up, see Appendix G. The code for reproducing all experiments is available at https://github.com/marwash25/subpruning.

Le Net on MNIST We pre-train Le Net model on MNIST achieving 97.75% top-1 accuracy. We prune all layers except the last classiﬁer layer. Results are presented in Figure 1 (left). All three variants of our method consistently outperform other baselines, even when reweighting is applied to them, with ASYMINCHANGE doing the best and LAYERINCHANGE the worst. We observe that reweighting signiﬁcantly improves the performance of all methods except LAYERGREEDYFS variants.

Res Net56 on CIFAR-10 We use the Res Net56 model pre-trained on CIFAR-10 provided in Shrink Bench [Blalock et al., 2020], which achieves 92.27% top-1 accuracy. We prune all layers except the last layer in each residual branch, the last layer before each residual branch, and the last classiﬁer layer. Results are presented in Figure 1 (middle). The sequential variants of our method perform the best. Their performance is closely matched by LAYERWEIGHTNORM and ACTGRAD (with reweighting) for most compression ratios, except very large ones. LAYERINCHANGE performs signiﬁcantly worst here than the sequential variants of our method. This is likely due to the larger number of layers pruned in Res Net56 compared to Le Net (27 vs 4 layers), which increases the effect of pruning earlier layers on subsequent ones. Here also reweighting improves the performance of all methods except the LAYERGREEDYFS variants.

VGG11 on CIFAR-10 We pre-train VGG11 model on CIFAR-10 obtaining 90.11% top-1 accuracy. We prune all layers except the last features layer and the last classiﬁer layer. Results are presented in Figure 1 (right). The three variants of our method perform the best. Their performance is matched by ACTGRAD and LAYERWEIGHTNORM (with reweighting). LAYERINCHANGE performs similarly to the sequential variants of our method here, even slightly better at compression ratio 32, probably because the number of layers being pruned is again relatively small (9 layers). As before, reweighting beneﬁts all methods except the LAYERGREEDYFS variants.

Discussion We summarize our observations from the empirical results:

Our proposed pruning method outperforms state-of-the-art structured pruning methods in various

one shot pruning settings. As expected, ASYMINCHANGE is the best performing variant of our method, and LAYERINCHANGE the worst, with its performance deteriorating with deeper models. Our results also illustrate the robustness of our method, as it reliably yields the best results in various settings, while other baselines perform well in some settings but not in others. Reweighting signiﬁcantly improves performance for all methods, except LAYERGREEDYFS

and LAYERGREEDYFS-fd. We suspect that reweighting does not help in this case because this

Figure 1: Top-1 Accuracy of different pruning methods applied to Le Net on MNIST (left), Res Net56 on CIFAR10 (middle), and VGG11 on CIFAR10 (right), for several compression ratios (in log-scale), with (top) and without (bottom) reweighting. We include the three reweighted variants of our method in the bottom plots (faded) for reference.

method already scales the next layer weights, and it takes into account this scaling when selecting neurons/channels to keep, so replacing it with reweighting can hurt performance.

The choice of how much to prune in each layer given a global budget can have a drastic effect on

performance, as illustrated in Appendix I.

Fine-tuning with full-training data boosts performance more than reweighting, while ﬁne-tuning

with limited data helps less, as illustrated in Appendix H. Reweighting still helps when ﬁne-tuning with limited-data, except for LAYERGREEDYFS variants, but it can actually deteriorate performance when ﬁne-tuning with full-data. Our method still outperforms other baselines after ﬁnetuning with limited-data, and is among the best performing methods even in the full-data setting.

8 Conclusion

We proposed a data-efﬁcient structured pruning method, based on submodular optimization. By casting the layerwise subset selection problem as a weakly submodular optimization problem, we are able to use the GREEDY algorithm to provably approximate it. Empirically, our method consistently outperforms existing structured pruning methods on different network architectures and datasets.

Acknowledgments and Disclosure of Funding

We thank Stefanie Jegelka, Debadeepta Dey, Jose Gallego-Posada for their helpful discussions, and Yan Zhang, Boris Knyazev for their help with running experiments. We also acknowledge the MIT Super Cloud and Lincoln Laboratory Supercomputing Center (supercloud.mit.edu), Compute Canada (www.computecanada.ca), Calcul Quebec (www.calculquebec.ca), West Grid (www.westgrid.ca), ACENET (ace-net.ca), the Mila IDT team, Idiap Research Institute and the Machine Learning Research Group at the University of Guelph, for providing HPC resources that have contributed to the research results reported within this paper. This research was partially supported by the Canada CIFAR AI Chair Program. Simon Lacoste-Julien is a CIFAR Associate Fellow in the Learning Machines & Brains program.

A. A. Bian, J. M. Buhmann, A. Krause, and S. Tschiatschek. Guarantees for greedy maximization of

non-submodular functions with applications. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 498 507. JMLR. org, 2017. (Cited on 22)

D. Blalock, J. J. G. Ortiz, J. Frankle, and J. Guttag. What is the state of neural network pruning?

ar Xiv preprint ar Xiv:2003.03033, 2020. (Cited on 1, 9, 26)

C. Bucila, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 06, page 535 541, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933395. doi: 10.1145/1150402.1150464. URL https://doi.org/10.1145/1150402.1150464. (Cited on 1)

S. Buschjäger, P.-J. Honysz, and K. Morik. Very fast streaming submodular function maximization,

2020. (Cited on 26)

M. D. Collins and P. Kohli. Memory bounded deep convolutional networks. ar Xiv preprint ar Xiv:1412.1442, 2014. (Cited on 1)

M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with

binary weights during propagations. In Advances in neural information processing systems, pages 3123 3131, 2015. (Cited on 1)

A. Das and D. Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse

approximation and dictionary selection. ar Xiv preprint ar Xiv:1102.3975, 2011. (Cited on 3, 4, 6, 7, 21, 24)

M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas. Predicting parameters in deep learning. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/file/ 7fec306d1e665bc9c748b5d2b99a6e97-Paper.pdf. (Cited on 1)

M. El Halabi, F. Bach, and V. Cevher. Combinatorial penalties: Structure preserved by convex

relaxations. Proceedings of the 21st International Conference on Artiﬁcial Intelligence and Statistics, 2018. (Cited on 22)

E. R. Elenberg, R. Khanna, A. G. Dimakis, and S. Negahban. Restricted strong convexity implies

weak submodularity. ar Xiv preprint ar Xiv:1612.00804, 2016. (Cited on 3, 4, 6, 7, 16, 21)

Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector

quantization. ar Xiv preprint ar Xiv:1412.6115, 2014. (Cited on 1)

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings

of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. (Cited on 9)

T. He, Y. Fan, Y. Qian, T. Tan, and K. Yu. Reshaping deep neural network for fast decoding by

node-pruning. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 245 249. IEEE, 2014. (Cited on 2, 5, 9)

Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In

Proceedings of the IEEE international conference on computer vision, pages 1389 1397, 2017. (Cited on 2)

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. Neural Information

Processing Systems (Neur IPS) Workshops, 2015. (Cited on 1)

T. Hoeﬂer, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste. Sparsity in deep learning: Pruning and

growth for efﬁcient inference and training in neural networks. ar Xiv preprint ar Xiv:2102.00554, 2021. (Cited on 2)

A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009. (Cited

A. Kuzmin, M. Nagel, S. Pitre, S. Pendyam, T. Blankevoort, and M. Welling. Taxonomy and evalua-

tion of structured compression of convolutional neural networks. ar Xiv preprint ar Xiv:1912.09802, 2019. (Cited on 1, 2, 7)

V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky. Speeding-up convolutional

neural networks using ﬁne-tuned cp-decomposition. In International Conference on Learning Representations, 2015. (Cited on 1)

Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel.

Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541 551, 1989. (Cited on 9)

Y. Lecun, C. Cortes, and C. Burges. The mnist databaseof handwritten digits, 1998. (Cited on 9)

B. Lehmann, D. Lehmann, and N. Nisan. Combinatorial auctions with decreasing marginal utilities.

Games and Economic Behavior, 55(2):270 296, 2006. (Cited on 22)

H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning ﬁlters for efﬁcient convnets. In 5th

International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview.net/ forum?id=r Jq FGTslg. (Cited on 2, 5, 8)

W. Li, M. Feldman, E. Kazemi, and A. Karbasi. Submodular maximization in clean linear time, 2022.

URL https://arxiv.org/abs/2006.09327. (Cited on 15, 30)

L. Liebenwein, C. Baykal, H. Lang, D. Feldman, and D. Rus. Provable ﬁlter pruning for efﬁcient

neural networks. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=BJxk Ol SYDH. (Cited on 2, 5, 8, 9, 26)

J.-H. Luo, J. Wu, and W. Lin. Thinet: A ﬁlter level pruning method for deep neural network

compression. In Proceedings of the IEEE international conference on computer vision, pages 5058 5066, 2017. (Cited on 2)

Z. Mariet and S. Sra. Diversity networks: Neural network compression using determinantal point

processes. ar Xiv preprint ar Xiv:1511.05077, 2015. (Cited on 2, 3, 5)

K. Mc Gufﬁe and A. Newhouse. The radicalization risks of gpt-3 and advanced neural language

models. ar Xiv preprint ar Xiv:2009.06807, 2020. (Cited on 1)

B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrák, and A. Krause. Lazier than lazy greedy.

In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015. (Cited on 5, 20)

P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for

resource efﬁcient inference. ICLR, 2017. (Cited on 2, 5, 8)

B. Mussay, M. Osadchy, V. Braverman, S. Zhou, and D. Feldman. Data-independent neural pruning

via coresets. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=H1gm Ha EKw B. (Cited on 2, 5)

B. Mussay, D. Feldman, S. Zhou, V. Braverman, and M. Osadchy. Data-independent structured

pruning of neural networks via coresets. IEEE Transactions on Neural Networks and Learning Systems, pages 1 13, 2021. doi: 10.1109/TNNLS.2021.3088587. (Cited on 2, 5)

B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):

227 234, 1995. (Cited on 3, 4)

G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of approximations for maximizing submodular

set functions I. Mathematical Programming, 14(1):265 294, 1978. (Cited on 3)

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De Vito, Z. Lin, A. Desmaison, L. Antiga,

and A. Lerer. Automatic differentiation in pytorch. 2017. (Cited on 26)

H. Phan. huyvnphan/pytorch_cifar10, Jan. 2021. URL https://doi.org/10.5281/zenodo.

4431043. (Cited on 26)

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.

In Y. Bengio and Y. Le Cun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.1556. (Cited on 9)

S. Srinivas and R. V. Babu. Data-free parameter pruning for deep neural networks. In Proceedings of

the British Machine Vision Conference (BMVC), pages 31.1 31.12. BMVA Press, September 2015. (Cited on 2, 5)

J. Su, J. Li, B. Bhattacharjee, and F. Huang. Tensorial neural networks: Generalization of neural

networks and application to model compression. ar Xiv preprint ar Xiv:1805.10352, 2018. (Cited on 1)

M. Sviridenko, J. Vondrák, and J. Ward. Optimal approximation for submodular and supermodular

optimization with bounded curvature. Mathematics of Operations Research, 42(4):1197 1218, 2017. (Cited on 22, 23, 24)

E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov. Analyzing multi-head self-attention:

Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797 5808, 2019. (Cited on 1)

M. Ye, C. Gong, L. Nie, D. Zhou, A. Klivans, and Q. Liu. Good subnetworks provably exist: Pruning

via greedy forward selection. ICML, 2020a. (Cited on 2, 3, 5, 7, 8, 9, 15, 26)

M. Ye, L. Wu, and Q. Liu. Greedy optimization provably wins the lottery: Logarithmic number of

winning tickets is enough. Advances in Neural Information Processing Systems, 33:16409 16420, 2020b. (Cited on 2, 3, 5, 8, 15)

Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discrimination-aware

channel pruning for deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/ 2018/file/55a7cf9c71f1c9c495413f934dd1a158-Paper.pdf. (Cited on 2, 7)

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] We specify that our focus is on

the limited-data regime in both the abstract and introduction. See also the discussions on lines 143-151 and 207-213 on when our approximation guarantee is non-zero, and on lines 173-181 on the cost of our method. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix B

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main exper-

imental results (either in the supplemental material or as a URL)? [Yes] We include the code with instructions on how to reproduce all our experimental results in the supplemental material.

(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] See Section 7 and Appendix G (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix G 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See Appendix G (b) Did you mention the license of the assets? [Yes] The license of the assets we used are

included in the code we provide. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

We include our code in the supplemental material. (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]