# spdy_accurate_pruning_with_speedup_guarantees__d7cd79dc.pdf

SPDY: Accurate Pruning with Speedup Guarantees

Elias Frantar 1 Dan Alistarh 1 2

The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular. At the same time, there is rapidly-growing computational support for efficiently executing the unstructuredsparse models obtained via pruning. Yet, most existing pruning methods minimize just the number of remaining weights, i.e. the size of the model, rather than optimizing for inference time. We address this gap by introducing SPDY, a new compression method which automatically determines layer-wise sparsity targets achieving a desired inference speedup on a given system, while minimizing accuracy loss. SPDY is the composition of two new techniques. The first is an efficient and general dynamic programming algorithm for solving constrained layer-wise compression problems, given a set of layer-wise error scores. The second technique is a local search procedure for automatically determining such scores in an accurate and robust manner. Experiments across popular vision and language models show that SPDY guarantees speedups while recovering higher accuracy relative to existing strategies, both for one-shot and gradual pruning scenarios, and is compatible with most existing pruning approaches. We also extend our approach to the recently-proposed task of pruning with very little data, where we achieve the best known accuracy recovery when pruning to the GPU-supported 2:4 sparsity pattern.

1. Introduction

Increasing the efficiency of deep neural networks (DNNs) has the potential to not just reduce the cost of computeand energy-hungry models, but also to make them more readily available and privacy-conscious, by allowing model execution on end-devices. There are many approaches for

1IST Austria 2Neural Magic. Correspondence to: Elias Frantar <elias.frantar@ist.ac.at>, Dan Alistarh <dan.alistarh@ist.ac.at>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

compressing DNNs for increased efficiency (Hoefler et al., 2021; Gholami et al., 2021). The one we focus on in this paper is sparsity, i.e. setting to zero a large fraction of the values in the big parameter matrices of a neural network. Sparsification of neural networks has a long history (Le Cun et al., 1990), and current pruning techniques can shrink models by more than an order of magnitude, while largely preserving accuracy (Hoefler et al., 2021).

Pruning methods are usually categorized by the granularity at which they are applied: Structured pruning aims to drop groups of related weights, e.g. filters in a convolutional layer, which directly leads to improved inference speeds, but is usually limited in the compression amount before significant accuracy loss occurs. Unstructured pruning methods have traditionally been focused on reducing model size, as they drop individual weights in often random patterns, which is harder to translate into faster execution. More recent semistructured methods (Mishra et al., 2021; Zhou et al., 2021; Lagunas et al., 2021) trade off additional structure in the zeroed weights, e.g. small rectangular blocks, with higher accuracy loss relative to unstructured pruning.

On the runtime side, increasingly advanced algorithms have been introduced to provide computational speedup also for unstructured sparse models, whether executed on CPUs (Kurtz et al., 2020; Neural Magic, 2021; Elsen et al., 2020), GPUs (Gale et al., 2020), or specialized hardware (Han et al., 2015; Dave et al., 2021). Currently, unstructured pruning provides some of the best compression-to-accuracy tradeoffs among existing approaches (Hoefler et al., 2021), and efficient sparse inference on CPUs, which is our main focus, is particularly interesting in terms of accessibility and cost. Further, several commodity CPUs, e.g. current AMD models, do not support efficient quantized arithmetic, making sparsity their primary means of accelerating inference.

One key practical issue is that state-of-the-art unstructured pruning methods, e.g. (Evci et al., 2020; Singh & Alistarh, 2020; Schwarz et al., 2021; Peste et al., 2021), do not directly take the behavior of acceleration methods into account, while existing speedup-aware structured pruning methods are not straightforward to adapt to the unstructured case. This means that sparsifying a model to reach a certain speed, rather than a certain size, with minimal accuracy loss, is still an extremely laborious process.

Contribution. We provide a general solution to this prob-

SPDY: Accurate Pruning with Speedup Guarantees

1.5 2.0 2.5 3.0 3.5 4.0 Speedup factor over dense version

Relative drop

Relative drop vs. Speedup factor

SPDY (ours) Uniform GMP Res Net50 YOLOv5-S YOLOv5-M BERT SQu AD

Figure 1. Speedup and relative performance measure drop trade-off after gradual pruning for SPDY and baselines on various models.

lem, called learned efficient Sparsity Profiles via DYnamic programming search (SPDY). SPDY automatically determines how much to prune each individual network layer to achieve a desired speedup on a given inference engine and hardware combination, while minimizing accuracy loss. The underlying optimization problem solved by SPDY is very general, as it also occurs in the context of structured pruning (He et al., 2018), non-uniform layer quantization (Hubara et al., 2021b; Yao et al., 2021), low-rank decomposition (Liebenwein et al., 2021), or even gradient compression (Markov et al., 2021). While we focus on unstructured pruning here, as it is an under-explored area, our approach should extend to these other settings as well.

First, unstructured pruning with a speedup target can be viewed as constrained optimization problem in terms of layer-wise execution timings, and so-called layer-wise error scores, and we propose an essentially exact dynamicprogramming solver for this problem. This algorithm is extremely efficient: it has linear complexity in the number of layers times the number of possible sparsity choices per layer, and can be easily scaled to very large models.

Second, we address how to reliably determine layer-wise error scores for unstructured sparsity. We first observe that known metrics, e.g. (normalized) weight magnitudes, do not correlate consistently with superior accuracy of the resulting unstructured sparse models. We then introduce a new approach which learns the layer-wise error-scores automatically, based on the network s global pruning behavior on a small set of input data. This relaxes the strictly layer-wise problem and thus makes it possible to account for global cross-layer effects, while still utilizing the advantages of the original formulation. Specifically, SPDY determines good layer-wise error scores via local search, which assesses the quality of profiles determined through our DP algorithm by how well a reconstructed version of the sparse model behaves on calibration data. For sparse model reconstruction, we leverage a new variant of the Ada Prune one-shot pruning technique (Hubara et al., 2021a).

We apply SPDY to determine optimal layer-wise compression levels for a sparsity-aware CPU inference engine (Neural Magic, 2021), both for one-shot and gradual pruning. We first show that SPDY sparsity profiles can be used to compress a larger model, e.g. Res Net101, to match the inference speed of a smaller model, e.g. Res Net50, in a single compression step, without finetuning, and still maintain an accuracy advantage. We then show that the layer-wise sparsity targets found by SPDY result in models with significantly better accuracy-vs-speed trade-offs than existing baselines, also when applying state-of-the-art gradual pruning methods, as shown in Figure 1. Further, SPDY can be used in conjunction with quantization-aware training.

As a second application, we consider GPU acceleration, where we propose an enhancement of the Ada Prune method (Hubara et al., 2021a), which we call global Ada Prune (g AP). We apply g AP to generate models with the GPU-supported 2:4 sparsity pattern in the post-training pruning setting of (Hubara et al., 2021a), where only a small amount of data is available for re-calibrating the model. Our extension significantly outperforms the original Ada Prune: the g AP models with 2:4 sparsity have higher accuracy than Ada Prune models imposing the much less stringent 4:8 sparsity pattern.

In sum, our work introduces two new techniques for accuracy-aware acceleration of deep learning models, applicable to both CPU and GPU-based inference, which can produce state-of-the-art results in both settings. While we focused on unstructured and semi-structured pruning in our experiments, our approaches should be directly extensible to other settings. We provide efficient implementations of our methods at https://github.com/ IST-DASLab/spdy.

2. Related Work

Existing pruning techniques range from simple approaches like gradual magnitude pruning (Hagiwara, 1994; Zhu & Gupta, 2017), which periodically drops the fraction of the weights with lowest magnitude, followed by model finetuning, to dynamic techniques like Soft Threshold Reparametrization (Kusupati et al., 2020), Movement Pruning (Sanh et al., 2020), or Rigging the Lottery (Evci et al., 2020), which adapt mask selection during training itself. Surprisingly, properly-tuned gradual magnitude pruning is often competitive with more complex methods (Singh & Alistarh, 2020; Evci et al., 2020; Frantar et al., 2021; Peste et al., 2021). None of the above pruning techniques specifically optimize for fast inference. However, most can be easily modified to work with layer-wise sparsity targets, and are thus complementary to the methods we introduce; we illustrate this in our experiments by employing a range of different pruning techniques for accuracy recovery.

SPDY: Accurate Pruning with Speedup Guarantees

While structured pruning is primarily concerned with inference speed, many works focus instead on pruning FLOPs. Yet, Liu et al. (2021) found that this is sometimes less correlated with real speedups than just reducing model size. This highlights the need to operate directly on timing data. Related structured pruning works are Constraint-Aware Importance Estimation (CAIE) (Wu et al., 2020) and Knapsack Pruning (KP) (Aflalo et al., 2020). The former greedily ranks filters by considering their contributions towards multiple resource constraints, while the latter directly solves for the filters to select under a single constraint, a 0/1 knapsack problem, via dynamic programming. The optimal solution to this knapsack problem can be approximated with a greedy approach similar to single-constraint CAIE or (Yang et al., 2020). However, in our constrained optimization problem (see Section 3.1) the choices, i.e. the sparsity levels, are more than binary per item /layer, and in practice layer speedups under sparsity are also non-linear, which makes an accurate approximation of the optimal solution more difficult. Furthermore, we show that the weight magnitudes and individual loss impact metrics, used by CAIE and KP, are not reliable for fine-grained unstructured pruning under constraints. Instead, our approach learns the error metrics automatically, allowing it to also inject global information into the layer-wise problem, which we find to be crucial for reliably finding good higher speedup solutions.

The idea of learning the layer sensitivities is related to Automated Model Compression (AMC) (He et al., 2018) which uses one-shot pruning performance as a proxy to train a reinforcement learning agent that predicts layer-wise (structured) sparsity targets. The idea of performing layer reconstruction via linear regression, introduced by (He et al., 2017) and further explored by (Evci et al., 2018), is a precursor to the Ada Prune method, upon which our accuracy evaluation is based. However, we directly solve for the speedup constraint with an efficient algorithm, whereas AMC enforces the constraint only implicitly by restricting the action space of the agent. Yet, this and similar reinforcement learning approaches (Ashok et al., 2018) suffer from high tuning complexity, making them difficult to apply to new models, and have not found widespread adoption. In contrast, we demonstrate how SPDY, with a fixed set of hyper-parameters, works reliably across tasks and models.

Another related research direction is speedup-aware model quantization, in the form of recent approaches like Ada Quant (Hubara et al., 2021b), HAWQv3 (Yao et al., 2021) and BRECQ (Li et al., 2021). These approaches solve constrained optimization problems to identify optimal perlayer bit-widths. A key simplifying feature is that these problems only have a very small number of possible choices per layer, as there are few default precision levels. Thus, the resulting problem can be solved fast without custom algorithms. This is not the case for unstructured pruning,

where there are significantly more pruning choices. In addition, methods for accurate post training quantization , i.e. with little available training data, like Ada Quant and Ada Round (Nagel et al., 2020), inspired the pruning equivalent Ada Prune (Hubara et al., 2021a). These techniques all perform a layer-wise optimization of the compressed weights to produce outputs as close as possible to the original model. We use Ada Prune as a basis for our one-shot pruning approach, but also extend it significantly.

In general, from the perspective of model compression under a target constraint, SPDY can be seen as a fusion of global search based approaches like AMC and layer-wise constraint-solver methods like Ada Quant; combining the respective advantages of both schemes.

3.1. Pruning for Speed: The Abstract Problem

There are two core problems when trying to search for a fast unstructured sparse model: (a) each individual weight (of which there are typically many millions) can either be dropped or kept, resulting in an enormous search space, and (b) the change in execution time for deleting a single weight is practically too small to be measured. The former challenge makes directly-extending filter-level optimization approaches (Aflalo et al., 2020) for structured pruning difficult, while the latter requires ranking-based approaches (Wu et al., 2020) to rely on usually-inaccurate FLOP proxies (Liu et al., 2021) instead of actual timings.

However, both of these problems can be circumvented by leveraging the fact that the unstructured sparsity masks produced by established pruning methods have close to random structure (Gale et al., 2020). Thus, unstructured acceleration techniques cannot rely on specific patterns in the layer sparsity, and have similar performance for masks of the same layer with the same sparsity level. Thus, we can reduce the overall problem of pruning for speed to identifying target sparsity values sℓfor each layer 1 ℓ L. Further, we can accurately estimate the runtime of such a sparsity profile by simply imposing the corresponding sparsities with random masks, without actually running any pruning algorithms. Yet, it is critical to explicitly time different layers and sparsity levels, since acceleration rates due to unstructured sparsity are typically non-linear: at low sparsity, sparse execution may be slower than dense execution, while speedup curves tend to flatten at high sparsities.

We wish to solve the problem of finding the best sparsity profile, in the sense of yielding the smallest possible model accuracy drop, while matching a threshold execution time T. To make this problem tractable, we require additional approximations. First, we assume that the overall execution time is given by the sum of the individual layer runtimes

SPDY: Accurate Pruning with Speedup Guarantees

ts ℓfor the chosen sparsities. This is not always exact, since inference runtimes may perform layer fusion, but it is generally a good estimate, as shown in e.g. (Cai et al., 2018). Next, we assume that pruning a layer ℓto sparsity s ultimately incurs some model error es ℓ, and that those errors are additive. This is a strong assumption, but one which is common in literature (Yao et al., 2021; Hubara et al., 2021b; Aflalo et al., 2020). Finally, we assume that the set of sparsity choices S is discrete, which is reasonable in practice, as very small sparsity increments usually only lead to negligible performance gains. Under these assumptions, we can state the following constrained optimization problem:

mins1,...,s L S

ℓ=1 esℓ ℓ s.t.

ℓ=1 tsℓ ℓ T. (1)

This problem can be formulated as an integer linear program (ILP) and solved with specialized software (Yao et al., 2021; Hubara et al., 2021b). However, since each additional option per layer adds L variables to the ILP, whose solving time usually increases exponentially in the number of variables, running an off-the-shelf solver would be prohibitively slow, for any model of interest, as a relatively fine-grained S is needed. We note that the execution time target T corresponding to a speedup target X can be calculated as Tdense/X Tbase where Tdense and Tbase are the original dense runtime and the total runtime of operations that are unaffected by pruning, respectively.

3.2. Efficiently Solving the Optimization Problem

In its most general form, the constrained optimization problem stated in Equation (1) is NP-hard and thus (most probably) requires exponential time to solve. However, if time is integer-valued, then, as we will show, the problem is actually solvable efficiently in time O(|S| LT) with O(LT) memory. In our use-case, we can discretize time into B buckets of width T/B, which means that T = B. Since we will be interested in large enough discretization B, e.g. B = 104, the inherent randomness in the individual timings will usually exceed the error incurred due to discretization, thus making this extra approximation negligible.

The key to efficiently solving the discrete version of problem (1) is the following observation: the lowest possible error achievable in the first ℓlayers while taking exactly time t to execute and choosing sparsity s at layer ℓ, denoted by Et ℓ(s), is the error caused by sparsity s, i.e. es ℓ, plus the minimum error achievable with all ℓ 1 previous layers while taking time exactly t minus the time the choice of sparsity s takes at layer ℓ. Then, the dependence of Et ℓ(s) on s can be eliminated by simply taking the minimum, leading to the following recursion:

Et ℓ= mins S Et ts ℓ ℓ 1 + es ℓ (2)

Et 1 = mins S es 1 if S = {s | ts 1 = t} = else . (3)

Using dynamic programming (DP), i.e. by caching intermediate values, the final solution mint T Et L can be computed efficiently by taking the minimum over |S| options at each of the LT values Et ℓ. This leads to the linear memory and compute costs claimed previously. Appendix 7 shows a complete bottom-up DP implementation including the bookkeeping to reconstruct the optimal sparsity profile.

In practice, this approach is highly-efficient. For example, when T = 104, |S| = 42 and L = 52 (a Res Net50 configuration) the DP algorithm finds the optimal solution in less than 100 milliseconds on a CPU and scales linearly for each parameter individually. Finally, we emphasize that this method could be used to solve any kind of layerwise optimization problem written as (1). This includes, for instance, non-uniform layer quantization (Hubara et al., 2021b; Yao et al., 2021), where the choices are quantization levels, low-rank decomposition (Liebenwein et al., 2021), where choices are simply ranks, or gradient compression (Markov et al., 2021), where the choices are the gradient bit-width. The challenge for adapting the procedure to a new setting is to find a robust and accurate layer scoring metric. Next, we derive such a metric for unstructured pruning.

3.3. Learning the Error Metric

For the optimal solution of optimization problem (1) to be useful, we need meaningful error metric values es ℓ. However, especially in the context of unstructured pruning, it is quite challenging to define a general such metric, which works well across different DNNs. This is illustrated by Table 1 comparing the one-shot accuracy of profiles generated via the DP algorithm in combination with common metrics such as squared weights (Yao et al., 2021), squared weights normalized (Liu et al., 2021) or the layer-wise loss change (Hubara et al., 2021b). More details and additional experiments can be found in Appendix 10. While designing a robust and precise pruning error metric from first principles is an interesting direction for future work, the alternative we propose is to circumvent this issue by learning an appropriate metric automatically, which we find to consistently outperform manual options, as shown in Table 1. This also has the additional advantage that global information, e.g. if two consecutive layers should not both be heavily pruned, is integrated as well. We emphasize that this approach is only enabled by the high efficiency of our DP algorithm, since in order to reliably learn a metric, we will have to solve the constrained optimization problem a large number of times.

Generally, the difficulty of pruning a layer depends on how sparse a layer already is and how much of the remaining weights are to be pruned: pruning 10% of the remaining weights will be easier than pruning 20%, and pruning a layer that is 50% sparse will be a lot easier than if the layer is 90% sparse. This suggests that we should view the error in log-sparsity space, i.e. in steps of pruning δ percent of

SPDY: Accurate Pruning with Speedup Guarantees

Method Res Net34 BERT 2.50 3.50 2.50 3.50

Uniform 64.65 42.38 57.42 09.00 DP + squared 54.18 31.59 71.20 07.88 DP + squared norm. 56.32 09.15 58.77 06.17 DP + layer-wise loss 65.92 46.11 30.59 06.13

SPDY 68.23 51.68 74.59 18.83

Table 1. One-shot accuracy comparison of SPDY search with DP using several common error metrics.

the remaining weights and, further, that the error still increases considerably faster than linearly in this parametrization. For this reason, we suggest the simple quadratic error model for each layer shown in (4) where the scalar coefficient cℓcontrols how quickly the error increases, i.e. it represents the sensitivity of the corresponding layer, and i {0, . . . , |S| 1}. We choose a quadratic approximation because this is the simplest function with the intuitive properties discussed above. In addition, the sensitivity coefficients can be interpreted as the curvature of the errors.

es ℓ= cℓ i |S| 1

2 , s = 1 (1 δ)i. (4)

Given layer-wise timings ts ℓfor all sparsity levels, and using the error definition (4), the DP algorithm will produce a valid profile with the desired speedup for any sensitivity coefficient vector c = (c1, . . . , c L) [0, 1]L. This means that we have reduced our original constrained problem to the unconstrained optimization problem of finding a c for which the DP algorithm will return a good sparsity profile with the target speedup.

We now need an efficient procedure to determine how good a given profile is, which we discuss in Section 3.4. Assuming such a procedure is available, we can then optimize c with some heuristic optimization method, e.g. local search. It may seem that this reformulation has not lead to significant progress: the goal remains to find one choice per layer (now a sensitivity rather than a sparsity) which ultimately yields a good profile. However, we have actually eliminated the speedup constraint, which is automatically taken care of by the embedded DP algorithm. Further, various useful priors are directly enforced in a principled way, e.g. layers with poor acceleration are only pruned strongly for low sensitivity values, and high sparsities are chosen only if they are required to reach the target speedup. As shown by comparing with a direct search (without reparametrization) and with a genetic programming method (Guo et al., 2020) (Figure 2), the SPDY reparametrization leads to better solutions with fewer model evaluations and reduced variance. Appendix 11 provides more details and additional analysis.

For the heuristic coefficient optimization, we have found the following randomized neighborhood-shrinking local search to work well, and therefore use it in all our experiments.

1. Sample 100 random candidates for c and choose the

0 500 1000 1500 2000 2500 #Evaluations

Calibration loss

Res Net50 2.50x

Genetic algorithm Direct local search Reparametrized local search (SPDY)

Figure 2. Comparison of SPDY with a direct search and genetic programming.

one with the best resulting profile as initial c . 2. Copy c while uniformly resampling k = 0.1 L random elements and replace the original c with the copy if the resulting profile is better. Stop the process if there was no improvement in 100 trials. 3. Decrement k by 1 and repeat step 2. If k = 0, then return the current c as the final solution.

3.4. Quickly Assessing the Quality of a Sparsity Profile

We are now left with the problem of efficiently evaluating the quality of a given sparsity profile, that is, predicting the accuracy of the resulting sparse model after fine-tuning. Intuitively, the main issue here is that, while applying a single pruning step, e.g. removing a fraction of weights by magnitude, may be very fast, pruned model accuracy collapses dramatically even at moderate sparsities, if pruning is applied in one-shot, and may not correlate well with model accuracy after fine-tuning. Recently, approaches such as second-order pruning (Singh & Alistarh, 2020; Frantar et al., 2021), as well as Ada Prune (AP) (Hubara et al., 2021a), have dramatically improved one-shot pruning performance, mostly by adapting the remaining unpruned weights to reduce the loss increase due to weights being removed at a step. We will leverage this idea to address our estimation problem, in particular via the Ada Prune approach.

Specifically, assuming that the layer ℓoutput function is fℓ(X, W), where X are the sample inputs of a small calibration data set and W are weight values, AP optimizes the sparse weights remaining after pruning W s to best reconstruct the functionality of the original dense weights W, independently for each layer, by minimizing the squared error between the dense and the sparse layer output:

argmin W s ||fℓ(X, W) fℓ(X, W s)||2 2, for layer ℓ. (5)

We leverage Ada Prune to evaluate the quality of a profile by building a layer reconstruction database. This database stores for each layer ℓand each sparsity s the reconstructon of the remaining weights after sparsity s has been imposed via Ada Prune. Then, we can check the quality of an arbitrary

SPDY: Accurate Pruning with Speedup Guarantees

non-uniform sparsity profile by performing two steps. First, we query the database for the corresponding reconstructed weights of each layer, each at its target sparsity. Second, we stitch together the resulting model from the reconstructed weights, and evaluate it on a given small validation set. In practice, we use the same data for validation as for the AP; similar to (Hubara et al., 2021a), we do not observe any overfitting. The loss of the model on this calibration data is a proxy for the quality of the sparsity profile.

Our results show that this reconstruction database approach provides significant improvements when estimating the quality of a given profile relative to previous approaches, such as measuring the loss directly after one-shot magnitude pruning (He et al., 2018). Yet, we found that we can still enhance its accuracy (see Appendix 5 for experiments), especially when targeting high sparsity values, by performing this estimation iteratively. Specifically, we start from the observation that solving the AP optimization problem in (5) can significantly alter the magnitudes of the unpruned weights. Practically, the weights chosen to be pruned if we applied a single high-sparsity step, e.g. 0% to 60%, could be quite different from those that would be selected if we pruned in several smaller increments, of e.g. 6 steps of 10%, after each of which we update the remaining weights. Following this intuition, our method performs iterative pruning via the Ada Prune approach, where the target sparsity is reached in several smaller pruning steps with AP optimization in between. For the reconstruction database generation, this just means that we bootstrap the pruning process of the next sparsity level with the result of the previous one, which incurs no extra cost. Finally, we illustrate the key role of the reconstruction database for SPDY in Appendix 10.

3.5. The SPDY Method: Overview

We now summarize the full SPDY method which is the combination of all the techniques discussed so far. A visual summary is given by Figure 3 and corresponding pseudo code can be found in Appendix 8. Our system takes as input a target execution time T, which is easily calculated from a target speedup X (see Section 3.1), and a set of possible sparsity choices S. We then generate timing data for each layer and each sparsity choice. Additionally, we precompute a reconstruction database for fast and accurate one-shot pruning (see Section 3.4). The output sparsity profile is then determined by a cyclic search procedure which finds sensitivity coefficients that impose a layer-wise error metric (see Section 3.3). Together with the previously computed timing data, these error values are passed to a dynamic programming solver (see Section 3.2) which determines a sparsity profile with target execution time T and minimal total error. The actual quality of this profile is then determined by stitching together layers from the reconstruction database (see Section 3.4) and computing the loss of the composite

model on a small calibration set. The search procedure (see Section 3.3) attempts to minimize this loss by adapting the layer-wise sensitivity coefficients and ultimately outputs the sparsity profile determined by the DP algorithm when applied to the best found sensitivities.

3.6. Post Training Pruning via global Ada Prune (g AP)

An appealing but challenging setup is post training quantization (Nagel et al., 2020; Hubara et al., 2021b), where acceleration should be done with a small amount of calibration data, without any finetuning. The pruning version of this problem is an interesting target for SPDY, since architectures such as AMD CPUs do not have quantization support, but have good speedups even at moderate sparsities.

Hubara et al. (2021a) showed the first post training pruning results with acceptable accuracy drops when applied to the 4:8 semi-structured sparsity pattern using their Ada Prune (AP) approach. However, for higher sparsities, the accuracy drop of AP becomes too large to be directly useful. To address this, we introduce an extension of AP we call global Ada Prune, or g AP for short, which boosts AP accuracy and thereby extends the practicality of post training pruning. We note that this technique is orthogonal to the SPDY method introduced in the previous sections, but can be very useful as an additional step for post-training applications.

The main motivation behind g AP is that standard AP optimizes each layer independently, thus not taking compounding errors into account, and also not considering that a slightly larger error on one layer might allow significantly reducing the error on another one. Thus, we complement AP with an optimization process over the small calibration set which (globally) optimizes the full model, in order to minimize the sum of relative layer-wise errors. We normalize the error of each layer by the squared magnitude of the output of the original dense model. This is important to ensure that all layers influence the objective equally. Specifically, let fℓ(Xℓ, Wℓ) be the output of layer ℓin the original dense model and let fℓ(Xs ℓ, W s ℓ) be the output of layer ℓin the sparse model. Then, the g AP loss is written as (6) and can be minimized by gradient-based optimization:

Lg AP(W s) =

||fℓ(Xℓ, Wℓ) fℓ(Xs ℓ, W s ℓ)||2 2 ||fℓ(Xℓ, Wℓ)||2 2 . (6)

4. Experiments

Setup. We now describe our experimental setup. In all our experiments, we use the same set of sparsity targets for each layer S = {0} {1 (1 0.4) δi | i = 0, . . . , 40} with δ = ((1 0.99)/(1 0.4))1/40. That is, we either set a layer to dense, or to one of 41 sparsity levels, each of which prunes an additional 10% of the remaining weights. The 0.4 sparsity lower bound is the minimum

SPDY: Accurate Pruning with Speedup Guarantees

Reconstruction Database

Layer-wise Timings

0.25 0.50 0.75

Sparsity Levels

Dynamic Programming

Calibration Data

Layer-wise Errors

0.25 0.50 0.75

0.80 0.40 0.15

Layer-wise Sensitivies

Shrinking Neighborhood

Local Search

Sparsity Profile

update Target Time

Figure 3. A visual overview of the full SPDY method.

sparsity at which the inference runtime provides some acceleration over dense execution, while 0.99 is an upper limit to prevent model breakdown. For time discretization, we always use B = 104 buckets as individual units of time. For experiments on Image Net (Deng et al., 2009) we follow (Hubara et al., 2021a), by defining the calibration set for AP, g AP and the profile search to contain exactly one randomlyselected training image per class. For other tasks, we select 1000 training samples at random for the calibration set. The reconstruction database generation performs 10 epochs of optimization over this calibration set, using Adam (Kingma & Ba, 2015) with batchsize 32 and learning rate 10 3 per sparsity level while g AP runs for 100 epochs with learning rate 10 5 and frozen batch norms. The profile search follows Section 3.3. With these settings, applying SPDY takes, for instance, 16min (13 database + 3 search) for Res Net18 or 51min (29 + 23) for Res Net50. This is executed on a single NVIDIA 3090 GPU, and can be significantly optimized.

We measure speedups and execute inference on the publiclyavailable Deep Sparse v0.9.1 CPU inference engine (Neural Magic, 2021; Kurtz et al., 2020), which is competitive when executing dense models with the standard ONNX and Open VINO runtimes, but can additionally leverage unstructured sparsity for speedup. Deep Sparse is currently the most mature CPU engine with sparsity support; yet, we emphasize that alternatives exist (Elsen et al., 2020; Gale et al., 2020), and that our techniques are independent of the runtime.

We compare against two baseline profiles: the uniform profile of minimal sparsity which matches the target execution time, and the profile determined by pruning using global magnitude pruning (GMP) until the target speedup is reached. Both are direct speedup-aware adaptations of widely-used pruning strategies, which result in reasonable profiles without manual tuning (He et al., 2019; Liu et al., 2021). Additionally, following other works (Elsen et al., 2020; Jayakumar et al., 2021), we always keep the first and the last layer dense, since those typically have disproportionately large effects on the model accuracy while taking only a very small fraction of the total computation time. (The only exception is YOLO, where the 3 output layers are compute-

heavy, so we only skip the input.) We emphasize that this helps the baseline profiles (uniform and GMP) significantly. (For instance, Peste et al. (2021) report 73.14% accuracy for a 95% sparse Res Net50 versus 74.16% for one with first and last layer skipped.) This choice has little effect on SPDY since our method generally detects the compute/sensitivity imbalance automatically, and skips those layers. We also experimented with a GMP version that normalizes magnitude scores by the corresponding FLOPs. However, this did not provide improvements without model-specific manual tweaking of layer-wise normalizers, e.g. on Res Net50 it would prune all the early layers to 99% sparsity. To obtain a consistent AP reconstruction database and timing data, we round our baselines to the fixed sparsities in S.

There are some unstructured pruning methods which relate their results to speed/efficiency: STR (Kusupati et al., 2020), Wood Fisher FLOPs (Singh & Alistarh, 2020) and AMC (He et al., 2018). The comparison in Figure 4 shows that they all perform similar or worse than our simpler uniform and GMP baselines. Additionally, some of these methods do not provide mechanisms to control the target speedup. Thus, we focus on comparisons with uniform and GMP profiles.

Layer-wise timings for the AMD system are collected on an Amazon AWS c5a.8xlarge machine with 16 cores, while for Intel CPUs we use a c5.9xlarge server with 18 cores. The listed speedups are for batchsize 64, except for BERT (Devlin et al., 2019), which uses batchsize 16. The speedups given in the tables below are calculated in terms of the layerwise timing data, which usually slightly underestimates the real speedups when running the engine with all extra optimizations turned on. This allows comparing different profile generation methods in isolation of highly enginespecific timing inaccuracies and will also make it easy for future work to directly compare with our results using the published timing data. We provide real speedup information for our most interesting profiles in Appendix 13, which demonstrates that the layer-wise approximation is indeed quite accurate in most cases.

Post Training Pruning. We begin in the post-training pruning setting, and consider the torchvision (Marcel & Ro-

SPDY: Accurate Pruning with Speedup Guarantees

1.5 2.0 2.5 3.0 3.5 Speedup factor

Relative drop

Relative drop vs. Speedup factor

SPDY (ours) Uniform GMP AMC Wood Fisher FLOPs STR Res Net50 Mobile Net V1

Figure 4. Comparison with results of speedup-related unstructured pruning methods on Res Net50 and Mobile Net V1 in terms of relative accuracy drop.

driguez, 2010) implementation of Res Nets (He et al., 2016) running on an AMD CPU. These models are popular, and they can be well-accelerated by Deep Sparse. We report performance directly after stitching the model from the AP database (labelled AP) as well as after a subsequent run of global AP (labelled +g AP). We consider the use-case of pruning each network to approximately the same inference speed as the next smallest variant available in torchvision (e.g. Res Net101 to Res Net50, see Table 2).

Table 2 clearly shows that Res Nets post training pruned with SPDY and global Ada Prune always exceed the accuracy of the smaller dense variant with approximately the same speed. In several cases, the difference is significant, e.g. 1.7% Top-1 for RN34 or 1.5% for RN18. Further, the SPDY profile almost always performs best. The only exception is the very large Res Net-152 with a low speedup target, where GMP performs 0.1% better. Meanwhile, the SPDY profile outperforms the baselines by 1% on RN50 and even by 2.5% on RN50 with doubled width (RN50w2). We get similar results when pruning all models to the same 1.5 speedup (see Appendix 12) but with smaller relative accuracy drops. Note that uniform pruning outperforms GMP at smaller models and higher speedups, while the situation is reversed for larger models and lower speedups.

Compounding Pruning and Quantization. Quantization is an alternative compression technique, which is popular with both GPUs and end devices (Gholami et al., 2021), as well as high-end Intel CPUs. The Deep Sparse runtime supports both techniques with speedup in complementary fashion. To leverage this, we set a pruning speedup target of 2x for Res Net50, and then run SPDY vs. uniform pruning in a post-training setting, followed by Quantization Aware Training (QAT) (Nagel et al., 2021) for 6 epochs to quantize weights and recover additional accuracy. Quantization provides an additional 2.6 speedup, so the end model is 5.2 faster than the dense full-precision version, on an AWS c5.12xlarge instance. SPDY s accuracy improvement after pruning is clear: we have 73.33% vs. 71.94% top-1 accu-

racy, for the semi-structured pruning pattern with blocks of 4 consecutive zeros required by Deep Sparse. This directly leads to improved 75.00% vs. 74.33% accuracy after the quantization-aware finetuning stage, in favor of SPDY.

Post Training N:M Sparsity. To gain additional insight into the accuracy difference between our global layer reconstruction approach relative to existing methods, we investigate the performance of global Ada Prune in the context of post training pruning models to full N:M sparsity, a type of semi-structured sparsity that can be well accelerated on the newer NVIDIA GPUs. Hubara et al. (2021a) were the first to demonstrate that achieving N:M sparsity in the post training setting with acceptable accuracy loss is possible using Ada Prune and batchnorm tuning (A + B). Unfortunately, their results are somewhat theoretical, as they only targeted 4:8 sparsity, which to our knowledge is not yet supported by hardware. We now show that by applying global AP (after standard AP), we can exceed their 4:8 accuracy results with the significantly more challenging 2:4 sparsity pattern that is already well supported by current NVIDIA GPUs. A comparison for the Res Net model family can be found in Table 3. We note that further improvements might be possible through optimized channel permutations (Pool & Yu, 2021). Additionally, in Appendix 12, we present results for post training 2:4 pruning YOLOv5 models and BERT.

Gradual Pruning Results. We now present our main results, generating speedup profiles and then performing gradual pruning using state-of-the-art pruning methods under their default parameters, with these layer-wise sparsity targets. We consider the standard Res Net50 model (He et al., 2016), as well as the hard-to-compress Mobile Net V1 architecture (Howard et al., 2017; Kusupati et al., 2020), the Small and the Medium versions of the popular YOLOv51

(Jocher, 2022.) object detector, and the widely used BERTbase for language modelling (Devlin et al., 2019) on the SQu AD dataset (Rajpurkar et al., 2016). Res Net50 is gradual pruned with AC/DC (Peste et al., 2021) for 100 epochs, Mobile Net V1 with M-FAC (Frantar et al., 2021) for 100 epochs and YOLOv5 and BERT with gradual magnitude pruning (using the setup of (Kurtic et al., 2022) but performing a pruning step every 0.01 epochs) for 240 and 30 epochs, respectively. The results are summarized in Table 4.

The SPDY profiles appear to yield more accurate final weights across all models and speedups, often with a large gap to the next best method. At higher speedups, the advantage of SPDY seems to increase further, with 1.4 points better accuracy for Res Net50 and even > 2 points better m AP@0.5 for YOLO models. In the case of YOLO, it should also be mentioned that the profile generation optimizes not the relatively complex training loss but simply the

1We evaluate using the author s validation script with default parameters and report the pycocotools m AP@0.5.

SPDY: Accurate Pruning with Speedup Guarantees

Base Pruning SPDY Uniform GMP AP +g AP AP +g AP AP +g AP

69.76 RN18 1.80 RN34 69.84 71.34 68.09 71.10 63.78 66.83 73.31 RN34 1.60 RN50 74.39 75.01 71.60 74.05 72.41 73.88 76.13 RN50 1.80 RN101 76.00 76.61 73.55 75.64 75.42 76.32 77.37 RN101 1.45 RN152 77.49 77.82 76.46 77.40 77.42 77.91 76.13 RN50 2.45 RN50w2 75.32 77.08 67.47 74.51 69.78 73.64

Table 2. Comparing profiles when post training pruning Res Net models to the same speed as their next smallest dense counter-part.

Model Dense 2:4 4:8 AP +g AP AP A+B +g AP

RN18 69.76 67.64 68.76 68.32 68.63 69.05 RN34 73.31 71.81 72.59 72.33 72.36 72.84 RN50 76.13 73.61 75.00 74.44 74.75 75.43 RN101 77.37 75.70 76.75 76.23 76.48 76.88 RN152 78.31 76.75 77.75 77.36 77.91

Table 3. Ada Prune (AP) and global Ada Prune (ga P) for post training pruning to 2:4 and 4:8 sparsity.

Model Dense Speed. CPU SPDY Uni. GMP

Res Net50 76.13 2.00 AMD 76.39 76.01 75.85 Res Net50 76.13 2.50 AMD 75.56 75.12 74.76 Res Net50 76.13 3.00 AMD 74.75 74.02 73.44 Res Net50 76.13 3.50 AMD 73.06 71.62 70.22

Mobile Net V1 71.95 1.50 Intel 71.38 61.33 70.63

YOLOv5s 56.40 1.50 Intel 55.90 54.70 55.00 YOLOv5s 56.40 1.75 Intel 53.10 50.90 47.20 YOLOv5m 64.20 1.75 Intel 62.50 61.70 61.50 YOLOv5m 64.20 2.00 Intel 60.70 58.30 57.20

BERT SQu AD 88.54 3.00 Intel 88.53 88.22 87.98 BERT SQu AD 88.54 3.50 Intel 87.56 87.23 87.22 BERT SQu AD 88.54 4.00 Intel 86.44 85.63 85.13 BERT SQu AD 88.54 4.00 Intel 87.14 86.37 86.39

Table 4. Comparing accuracy metrics for sparsity profiles after gradual pruning models with respective state-of-the-art methods.

squared error of the final output to the one of the original dense model. This suggests that our method can be applied successfully even if the calibration set is unlabelled.

In some experiments, the improvement via SPDY is relatively small, e.g. for the lower speedup BERT runs we get only 0.3 higher F1. However, we note that a) the performance improvement of SPDY is larger than the difference between uniform and GMP in those cases and b) this occurs after extensive finetuning, which usually narrows accuracy gaps. The fact that these differences are always consistent can be seen as an indication in favor of our method.

Finally, in these experiments all profiles were generated initially, based on the dense model. Potentially, results could be improved further by rerunning SPDY from a model that has already been pruned to an intermediate sparsity. We test this by generating new 4 speedup profiles based on the gradually-pruned 3 BERT models which are also the starting points for subsequent gradual pruning (indicated by BERT SQu AD in the table). Although the absolute gap seems to stay the same, all models are noticeably more accurate, so the relative gap in terms of F1-drop is bigger, suggesting that iterating SPDY can bring additional gains.

We also highlight the Res Net50 results, achieved with AC/DC, a method that trains and prunes the model from scratch. This implies that the sparsity profiles found by SPDY are not just fitted to a particular set of dense weights but also possess some architecture-specific generalization properties. In that context, the results of our experiments

reinforce the idea that better one-shot performance is a good indicator of performance after finetuning (He et al., 2018). They seem to suggest that good sparsity profiles can transfer also when running gradual pruning, which may be loosely connected to lottery ticket approaches (Frankle & Carbin, 2019). We discuss some of the profiles produced by SPDY in more detail in Appendix 15. In general, SPDY profiles achieve the same speedup with a slightly lower overall sparsity than uniform and GMP, due to prioritizing execution time (see also Appendix 14), which likely contributes to the higher accuracies after extensive finetuning. However, our approach can be adapted to any additive constraint, such as parameters, FLOPs, or energy consumption.

5. Conclusion & Future Work

We introduced SPDY, a method for automatically determining sparsity profiles optimized for a particular acceleration setup. The resulting profiles consistently perform better than ones determined by conventional methods, without any model-specific parameter tuning, both in one-shot and gradual-pruning scenarios. An extension of our layer reconstruction method can also provide improvements for post-training 2:4 pruning via global Ada Prune.

Our ideas can be extended to other settings: (1) a combination of a dynamic program constraint solver and automated learning of sensitivity scores could also be applied to determine compression profiles for low-rank approximation; (2) it could be interesting to apply our techniques to neural architecture search super-networks as a replacement for our one-shot pruning reconstruction database. In sum, our techniques provide notable improvements relative to current methods, and have significant potential for follow-up work.

6. Acknowledgements

We gratefully acknowledge funding from the European Research Council (ERC) under the European Union s Horizon 2020 programme (grant agreement No 805223 Scale ML), as well as computational support from AWS EC2. We thank Eldar Kurtic for code and hyper-parameters for BERT pruning, and the Neural Magic Team, notably Michael Goin and Mark Kurtz, for support with their software.

SPDY: Accurate Pruning with Speedup Guarantees

Aflalo, Y., Noy, A., Lin, M., Friedman, I., and Zelnik, L. Knapsack pruning with inner distillation. ar Xiv preprint ar Xiv:2002.08258, 2020.

Ashok, A., Rhinehart, N., Beainy, F., and Kitani, K. M. N2N learning: Network to network compression via policy gradient reinforcement learning. In International Conference on Learning Representations (ICLR), 2018.

Cai, H., Zhu, L., and Han, S. Proxyless NAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations (ICLR), 2018.

Cai, H., Gan, C., Wang, T., Zhang, Z., and Han, S. Oncefor-All: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations (ICLR), 2019.

Dave, S., Baghdadi, R., Nowatzki, T., Avancha, S., Shrivastava, A., and Li, B. Hardware acceleration of sparse and irregular tensor computations of ML models: A survey and insights. Proceedings of the IEEE, 109(10):1706 1752, 2021.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Image Net: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019.

Elsen, E., Dukhan, M., Gale, T., and Simonyan, K. Fast sparse convnets. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

Evci, U., Le Roux, N., Castro, P., and Bottou, L. Mean replacement pruning. 2018.

Evci, U., Gale, T., Menick, J., Castro, P. S., and Elsen, E. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning (ICML). PMLR, 2020.

Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.

Frantar, E., Kurtic, E., and Alistarh, D. M-FAC: Efficient matrix-free approximations of second-order information. In Conference on Neural Information Processing Systems (Neur IPS), 2021.

Gale, T., Zaharia, M., Young, C., and Elsen, E. Sparse GPU kernels for deep learning. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020.

Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. A survey of quantization methods for efficient neural network inference. ar Xiv preprint ar Xiv:2103.13630, 2021.

Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., and Sun, J. Single path one-shot neural architecture search with uniform sampling. In European Conference on Computer Vision (ECCV), 2020.

Hagiwara, M. A simple and effective method for removal of hidden units and weights. Neurocomputing, 6(2):207 218, 1994. ISSN 0925-2312. Backpropagation, Part IV.

Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both weights and connections for efficient neural networks. In Conference on Neural Information Processing Systems (Neur IPS), 2015.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

He, Y., Zhang, X., and Sun, J. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision (ICCV), 2017.

He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. AMC: Auto ML for model compression and acceleration on mobile devices. In European Conference on Computer Vision (ECCV), 2018.

He, Y., Liu, P., Wang, Z., Hu, Z., and Yang, Y. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., and Peste, A. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. ar Xiv preprint ar Xiv:2102.00554, 2021.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017.

Hubara, I., Chmiel, B., Island, M., Banner, R., Naor, S., and Soudry, D. Accelerated sparse neural training: A provable and efficient method to find N:M transposable masks. In Conference on Neural Information Processing Systems (Neur IPS), 2021a.

SPDY: Accurate Pruning with Speedup Guarantees

Hubara, I., Nahshan, Y., Hanani, Y., Banner, R., and Soudry, D. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning (ICML). PMLR, 2021b.

Jayakumar, S. M., Pascanu, R., Rae, J. W., Osindero, S., and Elsen, E. Top-KAST: Top-K always sparse training. ar Xiv preprint ar Xiv:2106.03517, 2021.

Jocher, G. YOLOv5, 2022. URL https://github. com/ultralytics/yolov5.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

Kurtic, E., Campos, D., Nguyen, T., Frantar, E., Kurtz, M., Fineran, B., Goin, M., and Alistarh, D. The Optimal BERT Surgeon: Scalable and accurate secondorder pruning for large language models. ar Xiv preprint ar Xiv:2203.07259, 2022.

Kurtz, M., Kopinsky, J., Gelashvili, R., Matveev, A., Carr, J., Goin, M., Leiserson, W., Moore, S., Nell, B., Shavit, N., and Alistarh, D. Inducing and exploiting activation sparsity for fast inference on deep neural networks. In International Conference on Machine Learning (ICML), 2020.

Kusupati, A., Ramanujan, V., Somani, R., Wortsman, M., Jain, P., Kakade, S., and Farhadi, A. Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning (ICML), 2020.

Lagunas, F., Charlaix, E., Sanh, V., and Rush, A. Block pruning for faster transformers. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.

Le Cun, Y., Denker, J. S., and Solla, S. A. Optimal brain damage. In Conference on Neural Information Processing Systems (Neur IPS), pp. 598 605, 1990.

Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. ar Xiv preprint ar Xiv:1608.08710, 2016.

Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., and Gu, S. BRECQ: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations (ICLR), 2021.

Liebenwein, L., Maalouf, A., Feldman, D., and Rus, D. Compressing neural networks: Towards determining the optimal layer-wise decomposition. Conference on Neural Information Processing Systems (Neur IPS), 2021.

Liu, L., Zhang, S., Kuang, Z., Zhou, A., Xue, J.-H., Wang, X., Chen, Y., Yang, W., Liao, Q., and Zhang, W. Group fisher pruning for practical network compression. In International Conference on Machine Learning (ICML), 2021.

Marcel, S. and Rodriguez, Y. Torchvision the machinevision package of torch. In ACM International Conference on Multimedia, 2010.

Markov, I., Ramezani, H., and Alistarh, D. Project CGX: Scalable deep learning on commodity GPUs. ar Xiv preprint ar Xiv:2111.08617, 2021.

Mishra, A., Latorre, J. A., Pool, J., Stosic, D., Stosic, D., Venkatesh, G., Yu, C., and Micikevicius, P. Accelerating sparse deep neural networks. ar Xiv preprint ar Xiv:2104.08378, 2021.

Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., and Blankevoort, T. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning (ICML), 2020.

Nagel, M., Fournarakis, M., Amjad, R. A., Bondarenko, Y., van Baalen, M., and Blankevoort, T. A white paper on neural network quantization. ar Xiv preprint ar Xiv:2106.08295, 2021.

Neural Magic. Deep Sparse, 2021. URL https:// github.com/neuralmagic/deepsparse.

Peste, A., Iofinova, E., Vladu, A., and Alistarh, D. Ac/dc: Alternating compressed/decompressed training of deep neural networks. In Conference on Neural Information Processing Systems (Neur IPS), 2021.

Pool, J. and Yu, C. Channel permutations for N:M sparsity. In Conference on Neural Information Processing Systems (Neur IPS), 2021.

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQu AD: 100,000+ questions for machine comprehension of text. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

Sanh, V., Wolf, T., and Rush, A. M. Movement pruning: Adaptive sparsity by fine-tuning. ar Xiv preprint ar Xiv:2005.07683, 2020.

Schwarz, J., Jayakumar, S., Pascanu, R., Latham, P., and Teh, Y. Powerpropagation: A sparsity inducing weight reparameterisation. In Conference on Neural Information Processing Systems (Neur IPS), 2021.

Singh, S. P. and Alistarh, D. Woodfisher: Efficient secondorder approximation for neural network compression. In Conference on Neural Information Processing Systems (Neur IPS), 2020.

SPDY: Accurate Pruning with Speedup Guarantees

Wu, Y.-C., Liu, C.-T., Chen, B.-Y., and Chien, S.-Y. Constraint-aware importance estimation for global filter pruning under multiple resource constraints. In Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020.

Yang, H., Gui, S., Zhu, Y., and Liu, J. Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

Yang, T.-J., Liao, Y.-L., and Sze, V. Net Adapt V2: Efficient neural architecture search with fast super-network training and architecture optimization. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Yao, Z., Dong, Z., Zheng, Z., Gholami, A., Yu, J., Tan, E., Wang, L., Huang, Q., Wang, Y., Mahoney, M., et al. HAWQ-v3: Dyadic neural network quantization. In International Conference on Machine Learning (ICML), 2021.

Zhou, A., Ma, Y., Zhu, J., Liu, J., Zhang, Z., Yuan, K., Sun, W., and Li, H. Learning N:M fine-grained structured sparse neural networks from scratch. In International Conference on Learning Representations (ICLR), 2021.

Zhu, M. and Gupta, S. To prune, or not to prune: exploring the efficacy of pruning for model compression. ar Xiv preprint ar Xiv:1710.01878, 2017.

SPDY: Accurate Pruning with Speedup Guarantees

7. DP Algorithm Implementation

Algorithm 1 We efficiently compute the optimal layer-wise sparsity profile with execution time at most T given S, es ℓ, ts ℓand assuming that time is discretized, using bottom-up dynamic programming.

D L (T + 1) matrix filled with P L (T + 1) matrix for s S do

if es 1 < D[1, ts 1] then D[1, ts 1] es 1; P[1, ts 1] s end if end for for ℓ= 2, . . . , L do

for t = ts ℓ+ 1, . . . , T do if es ℓ+ D[ℓ 1, t ts ℓ] < D[ℓ, t] then D[ℓ, t] es ℓ+ D[ℓ 1, t ts ℓ]; P[ℓ, t] s end if end for end for end for t argmint D[L, t] // return D[L, t] as optimal error for ℓ= L, . . . , 1 do

s P[ℓ, t] // return s as optimal sparsity for layer ℓ t t ts ℓ end for

We note that, in practice, the innermost loop over t can easily be vectorized (with corresponding acceleration through specialized CPU / GPU code) via shifting of D[ℓ 1, :] by ts ℓ, thereby making the overall algorithm highly efficient even for very fine discretization.

8. SPDY Pseudocode

This section provides additional details about the overall SPDY framework described in Section 3.5 in form of pseudo code. Specifically, Algorithm 2 and Algorithm 3 illustrate the initial timing collection and iterative Ada Prune reconstruction database generation. Meanwhile, Algorithm 4 presents the SPDY search for the optimal coefficients c .

Algorithm 2 Collect layer-wise timings ts ℓ.

for ℓ= 1, . . . , L do

W s ℓ randomly prune Wℓto sparsity s ts ℓ collect timing for layer ℓusing weights W s ℓ end for end for

Algorithm 3 Generate reconstruction database entries W s ℓ.

s sorted vector of sparsities in S for ℓ= 1, . . . , L do

W s0 ℓ dense weights WL for i = 1, . . . |S| do

W si ℓ Ada Prune for target si with weights W si 1 ℓ end for end for

Algorithm 4 SPDY search for optimal sensitivity values c . We use k = 100 and δ = 0.1 in our experiments.

function eval(c)

es ℓ compute by formula (4) using c for all ℓ sℓ run DP algorithm with es ℓfor all ℓ M stitch model for sℓfrom database Return calibration loss of M.

c sample uniform vector in [0, 1]L

for k times do

c sample uniform vector in [0, 1]L

if eval(c) < eval(c ) then

c c end if end for for d = δ L , . . . , 1 do

for k times do

Randomly resample d items of c in [0, 1] if eval(c) < eval(c ) then

c c end if end for end for

9. One-Shot Comparison

Figure 5 demonstrates how iterative Ada Prune significantly improves over standard Ada Prune and GMP.

0.400 0.640 0.784 0.871 0.922 0.954 0.972 0.983 0.990 Log-sparsity

Uniform Res Net50

Magnitude Ada Prune Iterative Ada Prune

Figure 5. Comparison of magnitude pruning, Ada Prune and iterative Ada Prune in terms of one-shot performance.

SPDY: Accurate Pruning with Speedup Guarantees

Res Net34 Res Net50 BERT SQua D 2.00 3.00 2.00 3.00 2.50 3.50

DP + squared weights 54.18 31.59 47.44 03.82 71.20 07.88 DP + squared weights normalized 56.32 09.15 35.42 01.20 58.77 06.17 DP + custom norm based 66.64 49.17 66.82 11.11 72.34 13.64 DP + layer-wise loss / no reconstruction 54.51 10.32 66.49 08.85 53.98 06.85 DP + layer-wise loss / iterative AP 65.92 46.11 69.26 20.22 30.59 06.13 SPDY / no reconstruction 53.38 22.18 64.59 00.67 62.41 07.58 SPDY 68.23 51.68 71.17 26.81 74.59 18.83

Uniform 64.65 42.38 64.65 05.84 57.42 09.00 Global magnitude 39.41 22.80 62.89 01.76 58.06 06.85

Table 5. A comparison of one-shot accuracies for profiles generated by DP in combination with various metrics, SPDY and uniform / global magnitude pruning.

10. Ablation Studies

In Section 3.3, we briefly discussed the difficulty of designing a reliable error metric for the use in conjunction with our DP algorithm. We now present a more detailed investigation of this topic including additional experimental results. Concretely, we consider the following 5 error metrics:

Squared Weights Perhaps the most obvious error score candidate is a natural extension of the popular magnitude pruning criterion (Zhu & Gupta, 2017), the sum of the squared magnitudes of all pruned weights. It can also be interpreted as the squared norm of the weight perturbation, which has been used in the context of quantization (Yao et al., 2021).

Squared Weights Normalized A potential problem of the previous metric is that smaller layers, which are often quite sensitive, might be pruned too strongly as their sums are inherently smaller than those of bigger layers. A simple way to address this is to normalize each sum by the number of elements in the corresponding layer. Such normalization is for example used for structured pruning in (Liu et al., 2021).

Custom Norm Based Another possible shortcoming of both metrics mentioned so far is that the sums for high sparsity choices generally differ only in a small number elements and can thus be quite similar. This means the fact that pruning typically becomes a lot more difficult at high sparsities may not be very well reflected in those error metrics. Hence, we developed a custom norm based metric to address this. Let maxw W s |w| be the magnitude of the largest dropped weight when pruning a layer to sparsity s, then the error is defined as:

es = maxw W s |w|

The numerator ensures that the influence of the weight magnitudes does not decrease for high sparsities and

the division by the remaining density that the error always increases strongly as a layer gets very sparse.

Layer-wise Loss No Reconstruction One more obvious option for a layer-wise error is simply the change in loss on a small calibration set that is the result of pruning exactly this layer to a certain sparsity. This metric measures the loss after standard magnitude pruning, without any layer-wise reconstruction, similar to (Li et al., 2016).

Layer-wise Loss Iterative AP Finally, we combine the layer-wise loss with the techniques we introduce in Section 3.4, i.e. using iterative Ada Prune to reconstruct the pruned layer before measuring the loss. This is similar in spirit to the Ada Quant loss used as an error metric for quantization (Hubara et al., 2021b).

We now generate profiles using DP in combination with each of the 5 described metrics, for 3 different models at 2 speedups each, a lower one where AP pruning still gives reasonable accuracies and a higher one which could be used as the starting point for extended gradual pruning as in Section 4. We consider two Res Nets to study the consistency of metrics for models of the same family and a very different BERT model. For computational reasons, we consider the accuracy after one-shot pruning by stitching together layers from the reconstruction database described in Section 3.4, which (He et al., 2018) as well as our main experimental results show to be decent proxies for post finetuning / gradual pruning accuracy. We compare with SPDY as well as the standard uniform and global magnitude strategies (see Section 4), input and output layers are always skipped. All results are summarized in Table 5.

First, we can see that the squared weights based metrics fail to beat the best non-DP profile in all but one case. The same is true for the layer-wise loss without any reconstruction. Clearly, those metrics are unsuitable for generating effective sparsity profiles. Meanwhile, the layer-wise loss with iterative AP reconstruction seems to perform quite well relative to the non-DP results on the Res Net models but

SPDY: Accurate Pruning with Speedup Guarantees

performs poorly on BERT where the losses are apparently so misleading that it is beaten even by the no-reconstruction version. A closer look reveals that the layer-wise losses do not all increase strongly with higher sparsities on this large model, which leads to very aggressive pruning of several layers compounding to large accuracy drops (that are not properly reflected in the single-layer losses). Additionally, it should be noted that the layer-wise loss works better than the custom metric on Res Net50 but worse on Res Net34. This means the best performing metric does not even seem to be consistent within a single model family. In general, the custom metric appears to perform reasonably in all 6 scenarios, outperforming the non-DP profiles, sometimes even by sizable margins. However, there is still a significant gap to the SPDY results in several cases. In fact, we found that the custom metric profiles often only lead to minimal 0.1 0.2 improvements over uniform ones when running extensive state-of-the-art gradual pruning, meaning that the extra improvements of SPDY are crucial to achieve the notable gaps we do in Section 4. Overall, these experiments demonstrate that manually designing a reliable layer-wise error metric for unstructured pruning is indeed a quite challenging problem, as the automatically learned error metric of SPDY consistently beats hand-crafted metrics of varying complexity.

Finally, we also briefly study the importance of the enhanced one-shot pruning performance through our reconstruction database. For that purpose, we run SPDY with one-shot magnitude pruning without any reconstruction to find profiles (the reported accuracy of course still stitches the resulting profile from the reconstruction database for comparability). As Table 5 shows, this dramatically worsens results, especially at high sparsities where simple one-shot pruning typically completely crashes the model, making it essentially impossible for SPDY to find useful sensitivity coefficients.

In summary, the experiments covered in this section strongly suggest that both the search procedure as well as the reconstruction database are indeed key components of our system and are not easily replaced by simpler means.

11. Search Procedure Discussion

In the main text, we briefly mentioned how the integration of the DP algorithm makes the search significantly easier. We now provide additional discussion and more details on our experimental setup.

While our search only ever considers solutions that lie directly on the constraint boundary, other approaches (Cai et al., 2019; Yang et al., 2021; Li et al., 2021) typically have to evaluate many candidates for which the constraint is either violated or not tight (in which case the solution cannot

be optimal) in order to make progress. Additionally, the DP solving always ensures proper utilization of the layer-wise speedups. For illustration, a layer with poor acceleration will not be pruned much for most sensitivity values. Similarly, due to the quadratic nature of our errors, most layers will only be pruned to very high sparsities, where speedup curves typically flatten, if this is really necessary to reach the overall speedup target. As a consequence, non-key layers will often have large ranges of acceptable sensitivity scores that all lead to a very similar sparsity choice. This makes finding a good profile in terms of c easier than by direct optimization in the huge space of all |S|L possible layer-wise sparsity configurations.2

We now demonstrate these claims empirically by comparing the average over 5 runs of our reparametrized unconstrained search using the DP algorithm, and a direct (constrained) search in the original sparsity space. For comparability, we use exactly the same local search method detailed in Section 3.3 for the direct search, just with resampling (uniformly in S) changes to the current solution until one is found that satisfies the speedup constraint. Additionally, we also test a genetic algorithm, implented exactly as described in (Guo et al., 2020) but using 2.5 more iterations for a fair comparison. Figure 6 shows the evolution of the objective value relative to the dominating runtime factor, the number of stitched model evaluations (calibration loss computations), for a 2.5 Res Net50 profile.

0 500 1000 1500 2000 2500 #Evaluations

Calibration loss

Res Net50 2.50x

Genetic algorithm Direct local search Reparametrized local search (SPDY)

Figure 6. Comparison of our reparametrized unconstrained search and a constrained direct search.

The plot clearly shows how the reparametrized search finds profiles with lower calibration loss, does so with much fewer evaluations ( 10 here) and exhibits significantly reduced variance in solution quality over multiple runs. Overall, this much increased efficiency suggests that the reparametrized DP search could probably also be used in conjunction with

2For example, the architecture search space in (Cai et al., 2019) is 1019 whereas ours is 1080 for the Res Net50 model.

SPDY: Accurate Pruning with Speedup Guarantees

1.00 1.25 1.50 1.75 2.00 Speedup

Res Net50 - Low Speedups

AP - SPDY AP - Uniform AP - GMP +g AP - SPDY +g AP - Uniform +g AP - GMP

2.00 2.25 2.50 2.75 3.00 Speedup

Res Net50 - Medium Speedups

AP - Auto DP AP - Uniform AP - GMP +g AP - Auto DP +g AP - Uniform +g AP - GMP

Figure 7. Accuracy after one-shot pruning vs. speedup for all profile generation methods on Res Net50. For improved visibility, as the y-axes are very different, the plot is split in two showing 1 to 2 speedup on the left and 2 to 3 on the right.

better but more expensive evaluation strategies. Interestingly, the genetic algorithm seems to perform noticeably worse than the simple local search. We suspect this is due to fact that in our huge search space, getting stuck in local minima seems to be less of a problem than actually reaching such a minimum quickly. Thus, focusing on mutation evaluations of just the current best solution leads to better results faster than distributing them across several candidates, as is done by genetic programming.

12. Additional Experiments

To investigate the one-shot performance of the different profile generation methods more closely, we prune RN50 to various speedup targets in steps of 0.25 and plot the results in Figure 7.

First, this plot shows that the graphs of uniform and GMP cross somewhere between 1.5 and 1.75 speedup. We can also observe how the relative order of the accuracies directly after AP stitching is preserved even through big accuracy increases by running global AP. Finally, we can clearly see how the gap between SPDY (after AP but also after g AP) and other methods rapidly grows as the speedup target is increased, up to over 10% accuracy at 3 speedup. Overall, it can be said that SPDY appears to work well in the challenging one-shot setting.

12.1. Global Ada Prune 2:4

Additionally, we now present results for post training 2:4 pruning various YOLOv5 models and BERT finetuned on SQu AD in Table 6. Although the m AP drops for YOLO are quite a bit bigger than the accuracy drops for Res Net, the results are still several points above the next smaller and roughly 2 faster model (a similar acceleration is promised by 2:4 sparsity) and could thus be relevant in practice.

Model Dense 2:4 AP +g AP

YOLOv5n 46.20 13.40 37.10 YOLOv5s 56.40 40.70 51.60 YOLOv5m 64.20 54.00 61.20 YOLOv5l 67.40 59.80 65.40

BERT SQu AD 88.54 84.75 87.41

Table 6. Global Ada Prune performance for 2:4 pruning the YOLOv5 model family as well as BERT-base on SQu AD. All input and output layers are skipped.

13. Real Timings

Model Target CPU Real

Res Net50 1.00 AMD 0.478s 1.00 Res Net50 2.00 AMD 0.234s 2.04 Res Net50 2.50 AMD 0.178s 2.69 Res Net50 3.00 AMD 0.143s 3.34 Res Net50 3.50 AMD 0.121s 3.95

Mobile Net V1 1.00 Intel 0.045s 1.00 Mobile Net V1 1.50 Intel 0.031s 1.45

YOLOv5s 1.00 Intel 0.641s 1.00 YOLOv5s 1.50 Intel 0.449s 1.43 YOLOv5s 1.75 Intel 0.380s 1.68 YOLOv5m 1.00 Intel 1.459s 1.00 YOLOv5m 1.75 Intel 0.848s 1.72 YOLOv5m 2.00 Intel 0.725s 2.01

BERT SQu AD 1.00 Intel 0.969s 1.00 BERT SQu AD 3.00 Intel 0.320s 3.03 BERT SQu AD 3.50 Intel 0.271s 3.58 BERT SQu AD 4.00 Intel 0.223s 4.35

Table 7. Real timings of the SPDY profiles in Table 4 using the Deep Sparse v0.9.1 engine. All timings are for batchsize 64, except for BERT which uses batchsize 16.

Table 7 shows the real timings of the final pruned models resulting from the SPDY profiles in our gradual pruning experiments in Section 4. We can see that the speedup predictions of the additive model are in most cases very accurate, especially considering typical performance fluctuations with varying operating temperature and that even different CPUs of the same type can have small perfor-

SPDY: Accurate Pruning with Speedup Guarantees

mance differences. Only for the higher speedup Res Net50 models and 4.00 BERT, the true speedups are noticeably underestimated. This is most likely due to special engine optimizations (we run with all of them turned on here) that are not active in the layer-wise timing mode. With sufficient knowledge about the engine internals, it would probably be possible to account for such effects in the DP algorithm. However, optimizing our methods for one particular inference engine was not the goal of this work.

14. Overall Sparsities

As our focus are real model speedups, our results in the main paper are all given with respect to those rather than parameter counts. Nevertheless, for completeness we now provide overall sparsity values (with respect to all pruned layers) corresponding to our main gradual pruning results in Table 4.

Model Speed. CPU SPDY Uni. GMP

Res Net50 2.00 AMD 70.31 80.54 85.77 Res Net50 2.50 AMD 86.15 88.33 92.30 Res Net50 3.00 AMD 91.78 93.01 96.24 Res Net50 3.50 AMD 95.22 96.58 98.26

Mobile Net V1 1.50 Intel 49.13 65.88 77.16

YOLOv5s 1.50 Intel 59.83 70.69 70.84 YOLOv5s 1.75 Intel 85.16 85.68 91.60 YOLOv5m 1.75 Intel 79.81 82.42 88.20 YOLOv5m 2.00 Intel 90.42 91.42 94.29

BERT SQu AD 3.00 Intel 82.24 84.14 83.98 BERT SQu AD 3.50 Intel 88.89 90.49 90.18 BERT SQu AD 4.00 Intel 94.15 94.86 95.29

Table 8. Overall sparsities of profiles corresponding to Table 4, calculated with respect to all pruned layers.

As can be seen in Table 8, SPDY profiles generally achieve the same speedup with a lower overall sparsity. Similarly, the relative difference in the number of remaining weights tends to increase for higher speedups. This is expected as our method directly takes execution information into account and can thus e.g. prioritize pruning layers that provide good speedups while leaving more weights on other (perhaps bigger) layers with worse acceleration behavior. However, we note that our techniques are quite general and can be easily adapted (by replacing the timings in the DP problem with the number of remaining parameters) to optimize for overall sparsity.

15. Profile Visualizations

Figure 8 displays visualizations of some of the SPDY profiles in Section 4. We will now analyze those briefly. Starting with Res Net50, we can see how the last convolution in a residual-block is typically pruned less than the others, with this effect being most pronounced in the early layers. Further, we can observe how the first conv is pruned

more than the second one early on with the roles seemingly switching in the later layers. Next, for Mobile Net V1, we can see that SPDY keeps all but the very last depth-wise convolution dense since those allow almost no acceleration while at the same being very sensitive. For the standard convolutions, SPDY seems to do the most pruning in the middle layers. YOLOv5s is a quite complex model and features also a correspondingly complex profile. We can see that the first conv of the pathway leading to the eventual output maps mconv1 is typically pruned less than the other layers while the convolution following it mconv2 is typically amongst the most strongly pruned ones. Additionally, convolutions not in a residual block conv are also pruned quite heavily in most cases. At last, one can notice that the 3 output layers are pruned to relatively high levels, verifying that those should not be skipped like for other models. In the BERT profile, the first projection after attention attention.output.dense is generally not pruned very strongly and the query and value matrices typically hover around the middle in terms of the assigned sparsities. Meanwhile, the fully-connected layers intermediate.dense and output.dense are usually amongst the most strongly pruned parts, which is to be expected as those make up a big portion of the overall run-time. All in all, the profiles found by SPDY exhibit various interesting patterns and carefully balance the layer-wise speed-sensitivity trade-offs in ways that seem very difficult to derive manually.

SPDY: Accurate Pruning with Speedup Guarantees

0 10 20 30 40 50 Layer

Res Net50 3.00x

downsample conv1 conv2 conv3

0 5 10 15 20 25 Layer

Mobile Net V1 1.50x

depthwise conv

0 10 20 30 40 50 60 Layer

YOLOv5s 1.75x

conv cv1 cv2 cv3 mconv mcv1 mcv2

0 10 20 30 40 50 60 70 Layer

BERT SQu AD 3.00x

attention.self.query attention.self.key attention.self.value attention.output.dense intermediate.dense output.dense

Figure 8. Sample visualizations of SPDY profiles.