# pruning_vs_quantization_which_is_better__a18b020a.pdf

Pruning vs Quantization: Which is Better?

Andrey Kuzmin, Markus Nagel, Mart van Baalen, Arash Behboodi, Tijmen Blankevoort Qualcomm AI Research Amsterdam, The Netherlands {akuzmin, markusn, mart, behboodi, tijmen}@qti.qualcomm.com

Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We provide an extensive comparison between the two techniques for compressing deep neural networks. First, we give an analytical comparison of expected quantization and pruning error for general data distributions. Then, we provide lower bounds for the per-layer pruning and quantization error in trained networks, and compare these to empirical error after optimization. Finally, we provide an extensive experimental comparison for training 9 large-scale models on 4 tasks. Our results show that in most cases quantization outperforms pruning. Only in some scenarios with very high compression ratio, pruning might be beneficial from an accuracy standpoint. 1

1 Introduction

Recent advances in deep learning led to exceeding human-level performance in many tasks, including computer vision, machine translation, voice recognition, and language understanding. Real-world applications of DNNs rely heavily on their efficiency. Both mobile and cloud platforms greatly benefit from reduced latency and energy efficiency achieved by some form of model compression. In this work, we consider two mainstream techniques used in practice; pruning and quantization.

Pruning methods remove individual weights [70, 25], or sometimes groups of weights [28, 47]. This procedure can reduce the memory footprint. Furthermore, not having to perform the computations with weights that are zeroed out can make network inference more efficient. On the other hand, quantization reduces the bit-width used for both the weights and the computation used in networks, leading to both predictable memory savings and reductions in the necessary compute. In both scenarios, the hardware used for making use of these optimization schemes needs to take them into account.

Depending on the availability of training data and computing budget, most methods for pruning and quantization fall into one of two families. The first family includes fine-tuning approaches, namely quantization-aware training (QAT) and fine-tuning with pruning in the loop. The second family includes post-training approaches such as post-training quantization (PTQ). Previously, pruning techniques primarily relied on fine-tuning; however, some post-training pruning methods appeared recently as fine-tuning is not desirable for large language models [18].

Despite the importance of model efficiency and the plethora of approaches for pruning and quantization, the two fields are mostly disjoint. The literature presents little insight into which of the two

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. 1Code is available at https://github.com/Qualcomm-AI-research/pruning-vs-quantization

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Figure 1: Comparison for a standard normal distribution. (left) Distributions after pruning and quantization for INT4 and 75% pruning. (middle) The squared error weighted by probability. (right) SNR for different compression ratios.

techniques is more accurate. In practice, there is only limited time to compress a network and limited energy to spend on making deep learning inference hardware. For this reason, we ask the question: Should one focus on quantization or pruning for compression?

We present an extensive study comparing pruning and quantization in equal settings. First, we consider different data distributions and analyze the conditions under which each method is preferable. We match our findings with real weight tensors from pre-trained models. Second, we consider a posttraining scenario and evaluate single-layer output errors for both methods. Because the comparison might depend on the specific choice of optimization method, we compare the two with theoretical bounds that apply regardless of the optimization method. Finally, we provide a full-model comparison for the most common scenario of fine-tuning networks after either pruning or quantization.

In our comparison, we intentionally avoid considering the hardware aspects of pruning and quantization. Instead, we focus solely on the accuracy of both methods, given similar theoretical compression ratios. A coarse discussion on the hardware necessary for both methods can be found in section 6.

2 Assumptions

In our work, we assume FP16 as the basic data type and measure any gains in compression with respect to it. Using FP16 for inference generally does not lead to a loss in accuracy. Neural networks are also very commonly trained with FP16, making it a common baseline. Thus, we compare 50% pruning sparsity to INT8 quantization, 75% sparsity to INT4 quantization and so forth. We also assume no overhead on storing the sparsity mask for pruning and relegate such hardware-specific implementations to section 6.

For the pruning experiments, we consider magnitude pruning. It is common to do fine-tuning after or during pruning [70]. Several works have independently shown that despite its simplicity, it is tough to improve upon magnitude pruning and fine-tuning [19, 4]. To our knowledge, no pruning algorithm exists that consistently outperforms this method.

For the quantization experiments, we use symmetric uniform quantization, which is defined by just the quantization scale factor and the bit-width. The scale is represented as a floating-point number and is used to map floating-point values to the integer grid. Further details on symmetric uniform quantization can be found in [49]. Uniform quantization is the standard in the quantization literature, and symmetric quantization is mostly employed for the weights. In all our experiments, we use a quantization range estimator minimizing the mean-squared error on weights by grid search [49].

3 Comparison on statistical distributions

Before diving into comparison results, we first describe theoretically what the quantization error and pruning error are. Looking at this with a theoretical lens helps with understanding the later experimental difference between the two methods. We start off by describing and analyzing both methods on simple data distributions.

In order to compare the error of pruning and quantization, we will frequently use the signal-to-noise ratio measure defined in the log scale: SNRd B = 10 log10 E W 2 /E (W F(W))2 , where F(W) is the quantization or pruning function. This measure is the same as a scaled logarithm of

Figure 2: Comparing the error of pruning and quantization for a student-t distribution, simulating the presence of significant outliers. We plot the results for different magnitudes of the outliers, as per the kurtosis on the x-axis. (left) the pruning error, which does not change under the presence of more severe outliers. (middle) the quantization SNR, which is reduced greatly when outliers increase (right) the trade-off regions where quantization and pruning are better.

an MSE measure. Both are often employed to analyze the sensitivity of neural network layers to quantization, and they are theoretically well-founded to correlate with network performance [41, 48].

3.1 Quantization error

For quantization, we consider symmetric uniform quantization, which is also called integer quantization. Given a bit-width b and the scale δ, the grid nodes are defined as qi = δi, i { 2b, . . . , 0, 2b 1}. The quantization operation rounding-to-nearest Q(w) and the corresponding quantization error R(w) are defined as:

Q(w) = qi, i = arg min i |w qi|, R(w) = Q(w) w. (1)

Following [36] we model neural network weights as a random variable W p(w). The expected value of the quantization MSE can be expressed as follows:

E Q(W) W)2 =

R2(w)p(w)dw +

(w qmin)2p(w)dw +

(qmax w)2p(w)dw,

where qmin = mini qi and qmax = maxi qi are the quantization range limits. The left term corresponds to the rounding error, and the right two terms correspond to the clipping error. We use this analytic formulation for our distribution results below, the details are given in appendix A.

3.2 Pruning error

We consider magnitude pruning T(x) = x 1 t x t. This simply sets the values closest to zero to actual zero. Given this, the expected error of pruning is expressed as follows:

t w2p(w)dw, (3)

where t is the threshold value that controls how much is pruned. Given the compression ratio c (0, 1), we find the threshold value which satisfies P( t W t) = c. In case of a symmetric zero-mean distribution, the threshold can be expressed as t = F 1 W 1

2 , where F(w) = P(W w) is the CDF function and F 1(p) is its inverse. The expected pruning error in equation 3 is similar to the clipping error for quantization (see the second and the third term in equation 2), and can also be computed analytically. We also use this formulation for our results below.

3.3 Analytical comparison

Standard normal distribution. Let us first look at a standard normal distribution. As many weights in neural networks are roughly Gaussian-shaped, this distribution is useful for our understanding of

Figure 3: (left) Comparison on all the weights from Py Torch model zoo (46 models) combined with 3 large language models (Bloom-3b, Llama-3b, OPT-2.7b). (left) Pruning SNR versus quantization SNR for every tensor. (right) Pruning is preferable at high compression ratios for tensors with high sample kurtosis values.

the comparison. As we can see from figure 1 (middle), the errors for both methods have very different behavior. The quantization error oscillates between the quantization nodes and has a moderate range. The pruning error effectively corresponds to rounding many weights to zero and thus has a higher error. As we can see in figure 1 (right), this results in a higher SNR for quantization, e.g. 19.1 d B for INT4 quantization versus only 5.6 d B for 75% pruning. We see similar results for different compression ratios. For this distribution, quantization achieves a much higher signal-to-noise ratio.

Distributions with heavy tails. The trade-off is expected to change when more significant outliers are introduced. The quantization grid is expected to be effected strongly by outliers as it increases the quantization grid in size, whereas the pruning method is expected to be hardly effected with outliers as it only affects weights around zero. We thus analyze both quantization and pruning errors in the presence of many outliers. To simulate a distribution with outliers, we use a truncated Student s-t distribution with ν = 2, and a symmetric range ( r, r) (the PDF is defined in appendix B). This distribution is nice as it gives a non-trivial weight to the tail ends of the distribution close to r. The wider the range r is, the heavier are the tails of the distribution.

In order to introduce a quantitative measure of the number of outliers, we will use the distribution s

kurtosis given by Kurt[X] = E h (X µ)4i / E h (X µ)2i 2 , where µ is the mean. We will see later that this kurtosis measure is predictive of quantiation and pruning performance for real layers. To increase the number of outliers, we will increase the range r. The results are given in figure 2. The kurtosis range is chosen so that it includes most of the weights from the model zoo. We see that despite the significant outliers and high kurtosis, quantization still has higher SNR in most of the cases for moderate compression. Pruning is better however in the region of high clipping range and very high compression rate, e.g. 2-3 bits per value (see figure 2 on the right).

3.4 Experiments on real weight tensors

The previous discussion was mostly theoretical. We set out to see happens when we do a similar analysis on real neural network weights. In order to investigate this, we compare the pruning and quantization SNR on the weight tensors for all the pre-trained models from the Py Torch model zoo2 (46 models in total, the details are give in appendix E) combined with weight tensors from 3 large language models, namely Bloom-3b [3], Llama-3b [20], OPT-2.7b [67]. Each tensor is quantized using an integer grid of bit widths from 2 to 8. The results are shown in the figure 3(left). We see a similar trend to our previous discussion that pruning becomes more beneficial for lower bitwidth/higher sparsity ratios.

In order to match the analytical results from figure 2, we consider the sample kurtosis of every weight tensor given by k = 1

n Pn i=1(xi x)4/ 1

n Pn i=1(xi x)2 2. See figure 3 (right). We consider a range of kurtosis values for every quantization bit-width. Using a kernel density estimator, we compute the probability density of encountering a tensor for which pruning has higher SNR than

2https://pytorch.org/serve/model_zoo.html.

quantization SNR. We compare the PDF to that for quantization and thus determine the region where each method is preferable. The results are given in figure 3 on the right. We see that the results from the previous theoretical section (figure 2 on the right) hold very nicely. We can also see that as predicted, the kurtosis is indeed a good metric for predicting if a tensor should be quantized or pruned for optimal accuracy.

4 Per-layer comparison

Most PTQ methods compress the model layer by layer. Given one layer, we use the mean-squared error of the output activations as an objective for optimization. As [48] shows, minimizing per layer MSE on the output activations of each layer is a computationally affordable second-order approximation of the loss function. The local MSE objective correlates well with the task loss and is often used in practice in DNN compression and quantization literature [32, 40, 68]. Our experiments in appendix D confirm this. For the experiments in this section, we will use SNR as it represents a normalized version of MSE. As opposed to section 3 where we used SNR on weights, in this section, we will use SNR on the output activations instead.

The goal of a PTQ method is to minimize the error in the output activations of the compressed layer by optimizing over the quantized weights subject to integer range constraints. Similarly, for pruning, the weights are optimized subject to a sparsity constraint. As the underlying combinatorial optimization problem for both methods is NP-hard [56, 14], in practice, each method relies on some form of heuristic providing a reasonably good solution given a realistic compute budget. This means that any practical comparison between pruning and quantization would depend on the choice of the method for both and would be open to debate of the optimality of the algorithm. In order to eliminate this dependence, we provide a tight lower bound on the output errors for quantization. For pruning we provide a way to solve the problem exactly for moderate dimensionalities. This way, we can provide a comparison that holds regardless of the algorithm used for each method.

4.1 Post-training quantization

We set out to formulate a way by which we can get relatively tight bounds for comparison when quantizing a single layer with the MSE as the objective. The higher bound is simple to obtain by using a solution with a heuristic quantization algorithm, but for the lower bound, we have to reformulate the problem. The mean-squared error of the output activations of a quantized layer can be expressed as:

min w E(w) = Xδw Xworig 2 2 (4)

s.t. w Zn, wmin wi wmax,

where X is the input data in an unfolded form, and worig are the floating point weights. The quantized weights are computed as the product of the quantization scale δ, and the integer weights w. wmin and wmax are the integer limits. We ignore the averaging operation to simplify the notation, as it is not important for optimization. We also note that this problem can be solved independently for each output channel of a convolution or every row of a fully-connected layer weight.

This problem is an instance of a mixed-integer quadratic program:

2w T P w q T w, (5)

s.t. w Zn, wmin wi wmax,

where P = 2δ2XT X, q = 2(w T orig XT )Xδ. In order to simplify the objective, we can omit the constant term that is irrelevant for the optimization c = Xworig 2 2, i.e. E(W ) = E(W ) c.

In order to find the lower bound of the objective, we follow [55] and relax the integer constraint to wi(wi 1) 0, which allows the weight to take values within the interval from 0 to 1. In order to

obtain the lower bound, we will consider the dual version of the relaxed problem:

L(λ) = max γ, (6)

s.t. P diag(λ) q + 1

where λ Rn, γ R. The dual problem is convex, and its solution can be used as a lower bound on the solution of the original problem, i.e., E(w) L(λ). The dual has a semi-definite constraint which can be solved with a semi-definite programming (SDP) solver with O(n3) complexity. In our work, we used CVX solver [21]. As discussed in [55], this bound is a computationally efficient alternative to branch-and-bound approaches, while tightness is better than that for the alternative methods introduced in [5]. We use this approach for estimating the lower bound for MSE on the output activations for PTQ below.

4.2 Post-training pruning

We also need a similar lower bound for pruning for comparison. To the best of our knowledge we are not aware of the ways to provide a tight lower bound for this problem, therefore we formulate a way to solve a problem for moderate dimensionalities exactly. Similar to quantization, post-training pruning of one layer of the network can mathematically be expressed as solving the following optimization problem:

E = min ˆ w X ˆw Xworig 2 2 (7)

s.t. ˆw 0 s,

where the number of non-zero elements s in the solution is theoretically constrained by using the L0 norm, which is non-convex and not smooth. In order to solve the problem, we introduce the sparsity mask m Rn:

E(w) = min w,m X(m w) Xworig 2 2 , (8)

s.t. m 1 = s, m l ˆw m u l, u > 0, mi {0, 1},

where is an element-wise product operation, and l, u R are constants chosen such that any solution satisfies the constraint m l ˆw m u. We solve this problem using the branch-andbound method implemented in the Gurobi solver [23] that gives the global solution.

4.3 Experiments

With our algorithms in the bag, we can now compare quantization versus pruning in the post-training settings with theoretical bounds. In each case, we analyze individual layers of several networks. Given a batch of input data, we optimize the pruned or quantized weights to minimize the error between the output activations and the output of the uncompressed layer. We provide a range between two SNR values for each method in each case. The performance of the heuristic method gives the first value, and the second value is given by the error lower bound or the global solution, which translates into SNR upper bound.

As a heuristic method for pruning, we use magnitude pruning with a fixed sparsity mask m and dataoptimized weights w given by w = argmin w X(m w) Xworig 2 2. This is a convex problem

and has a unique solution. As a heuristic method for quantization, we use the mixed-integer solver introduced in [55]. We clip every sample in order to satisfy the integer quantization range constraint.

We chose a representative set of 10 layers, including 9 convolutional layers (one 3x3 convolutional layer and 8 point-wise convolutions) from Mobile Net-V2, Efficient Net-lite, and Resnet-18, and one fully-connected layer from Vi T. The full details for reproducing the experiments are given in appendix F. Due to the high computational complexity of the global solution for pruning, the layers

Figure 4: Comparison in the post-training scenario. Each box corresponds to a subset of one of 10 layers from the 4 different models that were used, with 7 different bit-width comparison points. The ranges of the box indicate the lower and higher-bounds found by the algorithms.

had to be split into chunks. The slice of 4 input channels over all output channels was used for 3x3 convolutions. In the case of linear layers and point-wise convolutions, slices 36 input features over all the output features were used.

The results are shown in figure 4 grouped by bit-width. The rectangles indicate the full range of the pruning and quantization methods between the heuristic solution and the error lower bound or the global solution. Whenever a rectangle for each chunk intersects the diagonal line, the ranking of the two methods could depend on the optimization method, while in cases below or above the diagonal, the ranking is guaranteed regardless of the optimizer. We see that quantization mostly outperforms pruning for moderate compression, while methods become more comparable for higher compression ratios.

5 Full-model comparison

Now that we have seen the comparison between the methods in the PTQ setting, we turn to fine-tuning quantized and pruned models. This is the setting where pruning is applied in most, and it is possible that fine-tuning can change the models significantly enough that the performance between the two methods changes.

In order to provide a fair comparison of pruning and quantization, we chose the two most commonly used methods with performance competitive to state-of-the-art. For quantization-aware training, we used the widely adapted LSQ method suggested in [12, 2]. Following this approach, we jointly learn the weights and quantization scales, keep the batch norm layers unfolded, and re-estimated the batch norm statistics after training to avoid wrong running estimates due to oscillations [51]. We use the method suggested in [70] for pruning, which gradually increases the sparsity during fine-tuning and re-estimates batch norm statistics after training.

In our experiments we used a set of 4 models trained for 4 tasks including Resnet18, Resnet50 [27], Mobile Net-V2 [58], Mobile Net-V3-small [30], Efficient Net-lite [60], and Vi T [11] trained on Image Net classification [57]; Deep Lab-V3 [7] with Mobile Net-V2 backbone trained for semantic segmentation on Pascal VOC [13]; Efficient Det [61] trained for object detection on MS COCO [43]; OPT-350 fine-tuned on Wiki Text-103.

For a fair comparison, we used the same amount of epochs of fine-tuning for each method (full details on hyperparameters are given in appendix G). The results given in table1 suggest that pruning almost never leads to higher accuracy than quantization if an equal compression rate is considered. The differences are sufficiently large enough that the small purported improvements by some methods [59] will likely not close the gap.

To study the effect of training time, we also performed an ablation with 2 times longer fine-tuning on a subset of 3 models (Resnet50, Efficient Net, and Vi T). The results are given in appendix H. We observe that pruned models generally benefit from fine-tuning more, and in particular pruning

Model Orig. Metric Method 8b 7b 6b 5b 4b 3b 2b

Resnet-18 69.7 acc. quant. 70.5 70.5 70.6 70.3 70.0 68.9 67.3 pruning 70.3 70.1 69.9 69.5 69.3 68.3 66.8

Resnet-50 76.1 acc. quant. 76.4 76.4 76.4 76.3 76.2 75.5 72.3 pruning 76.6 76.4 76.2 76.1 75.9 75.4 74.3

Mobile Net-V2 71.7 acc. quant. 71.9 72.0 71.7 71.6 70.9 68.6 59.1 pruning 68.1 65.6 61.9 56.3 48.0 34.0 21.2

Efficient Net 75.4 acc. quant. 75.2 75.3 75.0 74.6 74.0 71.5 60.9 pruning 72.5 70.9 68.1 63.6 56.4 44.5 27.1

Mobile Net-V3 67.4 acc. quant. 67.7 67.6 67.1 66.3 64.7 60.8 50.5 pruning 65.6 64.4 62.4 60.2 56.1 31.7 0.0

Vi T 81.3 acc. quant. 81.5 81.4 81.4 81.0 80.4 78.4 72.2 pruning 76.6 76.6 76.2 73.1 72.4 71.5 69.4

Deep Lab-V3 72.9 m Io U quant. 72.3 72.3 72.4 71.9 70.8 63.2 17.6 pruning 65.2 62.8 56.8 47.7 32.9 18.6 10.0

Efficient Det 40.2 m AP quant. 39.6 39.6 39.6 39.2 37.8 33.5 15.5 pruning 34.5 33.0 30.9 27.9 24.2 17.9 8.0

OPT-350m 14.8 perpl. quant. 14.8 14.8 14.9 15.0 15.3 15.9 19.9 pruning 18.0 19.7 22.6 27.2 35.4 53.5 101.4

Table 1: Comparison of QAT and magnitude pruning with fine-tuning given equal model size and equal number of epochs for fine-tuning.

becomes more beneficial for most compression ratios on Resnet50. However, for the other models, quantization is still more beneficial due to a larger gap in performance.

Combining pruning and quantization Another interesting questions is whether pruning is beneficial in combination with quantization. To answer it, we perfromed an experiment on pruning quantized Resnet-18, Mobile Net-V2 and Vi T with different pruning ratios. The results are given on figure 5. On x-axis we plot the expected bit-widths which is a product of the base bit-width and the sparsity in the pruned model including the natural sparsity. The points marked by crosses are quantized models with only natural sparsity and no extra pruning applied. As we can see, mild degrees of pruning are beneficial in the combinations. However, we note that no extra overhead was assumed for storing the pruning mask.

Figure 5: Combining pruning and quantization on Image Net models. The average bit-widths shown on x axis is computed as a product of the base bit-width and the density of non-zero weight elements. Different pruning ratios are applied to each base bitwidth model. Quantized models with only natural sparsity and no extra pruning are marked with crosses.

6 Discussion

Other types of pruning While we solely focused in our comparison on unstructured pruning in which individual weights are removed, our results translate to semi-structured and structured pruning. Unstructured pruning has more degrees of freedom and is a strict superset of what can be

represented by (semi-)structured pruning. Therefore, unstructured pruning gives an upper bound of the accuracy for all pruning methods. This means that for the cases in which quantization is better than unstructured pruning, quantization will also be better than (semi-)structured pruning. However, we can not make any claims for (semi-)structured pruning for the few scenarios in which pruning is better than quantization.

Natural sparsity in quantized tensors In our comparison, we used a theoretical compression ratio for quantization, which depends on the bitwidth. However, we also observe that quantized tensors naturally contain many zeros; for example, 8-bit tensors from Py Torch model zoo have an average sparsity of 13% while 4-bit tensors are 35% sparse. We give more details on this in appendix C.

Representations learned in the compressed models To provide insights into representations learned during pruning or QAT, we studied the evolution of models during fine-tuning. We found that fine-tuning after pruning tends to recover the original representation, while quantization-aware training leads to learning completely new representations. We provide further details on these experiments in appendix I.

Hardware implications So far, we have deliberately avoided discussing the hardware implementations of pruning and quantization and focused solely on the accuracy of both methods at the same ideal compression rates. However, in practice, the hardware considerations do matter for the usability of the methods.

The analysis above assumed an idealistic case for pruning in terms of memory size and data transfer. Since the pruning is unstructured, in order to achieve memory savings in practice, one would need at least 1 bit of information for each weight indicating whether a weight is pruned or not. On top of 16-bit weights, this gives a 6.25% storage overhead at a minimum. Quantization does not have this overhead, as INT8 is just 8 bits smaller than 16 bits, and the only storage overhead is a single scaling factor per tensor (or channel).

Also, in terms of the cost of computations done by the hardware, there is a difference between the two methods. For pruning, any hardware would have to take the densely stored weights and mask and either decompress them to the dense format with all weights and many 0s or take the pruning into account in the compute itself. No compute benefits are gained in the former, as the dense calculations are done in the uncompressed number format. In the latter, dedicated hardware to take into account the 0s is necessary. The overhead for this is generally non-trivial, leading vendors to implement more semi-structured pruning schemes [47]. Similarly, it is rare to see unstructured activation compression for the same reason that this needs to happen algorithmically on-the-fly. In contrast, quantization gives quadratic improvements in the compute. Going from INT8 to INT4 theoretically improves the compute performance by a factor 4, although practical gains depend on the memory overhead (which improves by only a factor 2x) and the existence of other formats in the same hardware compute unit.

Impact Using pruning or quantization leads to power reduction on many architectures and enables new applications on mobile platforms. We see only a positive impact from this on the whole. In some cases both pruning and quantization might lead to biased predictions, a further discussion can be found in [29].

Limitations First, our work has not extensively considered the hardware implications of pruning or quantization. Second, we do not study combinations of pruning and quantization apart from analyzing the inherent sparsity due to pruning. We leave this for future work. Finally, we consider only uniform quantization and ignore the other formats, such as low-precision floating or logarithmic quantization, although these are not likely to change the results presented in this paper.

7 Related work

Quantization Integer quantization, or fixed-point quantization, is one of the most widely used techniques for inference, allowing to reduce the latency and improved energy efficiency. There are two main families of methods for model quantization. The first family includes post-training quantization (PTQ) methods [42, 52, 10, 1, 9, 6, 48, 40], which improve the model accuracy based on per-layer optimization of the quantized weights in a data-optimized fashion. The second family includes quantization-aware training methods [22, 34, 69, 8, 44, 12, 35, 2, 63, 51] which usually fine-tune

the model with quantization in the loop using straight-through estimator (STE) for computing the gradient of rounding operations. A more comprehensive overview of quantization methods can be found in [50].

Pruning Neural network pruning is one of the oldest methods to compress neural networks [37, 26]. A central problem in pruning is how to choose which weights to prune. Approaches published in the literature include: binary gating, in which a binary gate is learned on each individual weight [45, 46, 64]; sensitivity-based methods [39, 38, 66, 17, 18] in which sensitivity, based on a weights gradient or hessian diagonal value, is used, and magnitude pruning [24, 54, 70, 47, 59]. While conceptually simple, magnitude-based methods have been shown to consistently outperform more intricate methods at scale [19, 4]. Weight re-initialization schemes [15, 16] or mask-reinitialization [59] yield additional minor improvements. While most pruning approaches require fine-tuning and yield unsatisfactory results in post-training scenarios, recent adaptations of Hessian-based sensitivity approaches [37, 26], in which the Hessian of a layerwise reconstruction loss is used instead of the task loss Hessian, show good pruning results in post-training pruning of large language models [17, 18].

Combining pruning and quantization A number of works study combinations of pruning and quantization with different levels of granularity [24, 64, 31, 65, 62, 65].

Comparing pruning and quantization Despite the large amount of work on pruning, quantization, and combining them, there is little literature comparing the two methods. To the best of our knowledge, the closest work that performs a comparison of pruning versus non-uniform quantization [33]. The work considers only small-scale models and provides only an empirical comparison with no further analysis. Another related study is [53].

8 Conclusion

We have seen in this paper that in several settings, unstructured pruning only performs better than quantization in rare cases. In our theoretical analysis of distributions and on the real-layer-data, pruning is only better than quantization, compressing the network to an equivalent of 2 or 3 bits. This amount of compression comes with such a degree of a drop in performance it is rarely used in practice. The post-training quantization results are also informative. In the setting without fine-tuning, we have shown with theoretical bounds on many layers in neural networks that quantization is almost always provably better than pruning. Our hypothesis is that quantized layers are more accurate than pruned ones, as shown in the theoretical and PTQ setting, and fine-tuning a network is still highly dependent on that. This is in line with fine-tuning results, in which for many networks trained under the same conditions, quantization always has higher performance than pruning.

The conclusion is clear: Quantization generally outperforms pruning for neural networks. Taking into account the unfavorable hardware implications for pruning described, it could be argued that the conclusion holds even stronger. Based on this research, we recommend quantizing neural networks when efficiency is required before pruning is explored.

9 Acknowledgement

We would like to thank Marios Fournarakis and Yelisei Bondarenko for their help with performing QAT experiments.

[1] Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, 2019.

[2] Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, and Nojun Kwak. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020.

[3] Big Science Workshop. BLOOM (revision 4ab0472), 2022.

[4] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129 146, 2020. [5] Christoph Buchheim, Ruth Huebner, and Anita Schoebel. Ellipsoid bounds for convex quadratic integer programming. SIAM Journal on Optimization, 25(2):741 769, 2015. [6] Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. Zeroq: A novel zero shot quantization framework. ar Xiv preprint ar Xiv:2001.00281, 2020. [7] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation, 2017. [8] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. PACT: parameterized clipping activation for quantized neural networks. ar Xiv preprint arxiv:805.06085, 2018. [9] Yoni Choukroun, Eli Kravchik, and Pavel Kisilev. Low-bit quantization of neural networks for efficient inference. International Conference on Computer Vision (ICCV), 2019. [10] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293 302, 2019. [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2020. [12] Steven K. Esser, Jeffrey L. Mc Kinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Learned step size quantization. In International Conference on Learning Representations (ICLR), 2020. [13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303 338, June 2010. [14] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Springer New York, New York, NY, 2013. [15] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. ar Xiv preprint ar Xiv:1803.03635, 2018. [16] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis. ar Xiv preprint ar Xiv:1903.01611, 2019. [17] Elias Frantar and Dan Alistarh. Optimal brain compression: A framework for accurate posttraining quantization and pruning. ar Xiv preprint ar Xiv:2208.11580, 2022. [18] Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. [19] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. ar Xiv preprint ar Xiv:1902.09574, 2019. [20] Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, May 2023. [21] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014. [22] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, (ICML), 2015. [23] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. [24] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ar Xiv preprint ar Xiv:1510.00149, 2015. [25] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015. [26] Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293 299. IEEE, 1993.

[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition, 2016. [28] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389 1397, 2017. [29] Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. Characterising bias in compressed models. ar Xiv preprint ar Xiv:2010.03058, 2020. [30] Andrew Howard, Ruoming Pang, Hartwig Adam, Quoc Le, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, and Yukun Zhu. Searching for mobilenetv3. In International Conference on Computer Vision (ICCV), 2019. [31] Peng Hu, Xi Peng, Hongyuan Zhu, Mohamed M Sabry Aly, and Jie Lin. Opq: Compressing deep neural networks with one-shot pruning-quantization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7780 7788, 2021. [32] Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pages 4466 4475. PMLR, 2021.

[33] Yerlan Idelbayev and Miguel Á Carreira-Perpiñán. An empirical comparison of quantization, pruning and low-rank neural network compression using the lc toolkit. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1 8. IEEE, 2021. [34] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [35] Sambhav R. Jain, Albert Gural, Michael Wu, and Chris Dick. Trained uniform quantization for accurate and efficient neural network inference on fixed-point hardware. arxiv preprint arxiv:1903.08066, 2019. [36] Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. Fp8 quantization: The power of the exponent. ar Xiv preprint ar Xiv:2208.09225, 2022. [37] Yann Le Cun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989. [38] Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip HS Torr. A signal propagation perspective for pruning neural networks at initialization. ar Xiv preprint ar Xiv:1906.06307, 2019. [39] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. ar Xiv preprint ar Xiv:1810.02340, 2018. [40] Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations (ICLR), 2021. [41] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International conference on machine learning, pages 2849 2858. PMLR, 2016. [42] Darryl D. Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, 2016. [43] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision ECCV 2014, pages 740 755, Cham, 2014. Springer International Publishing. [44] Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Efstratios Gavves, and Max Welling. Relaxed quantization for discretized neural networks. In International Conference on Learning Representations (ICLR), 2019. [45] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0 regularization. ar Xiv preprint ar Xiv:1712.01312, 2017.

[46] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l0 regularization. International Conference on Learning Representations (ICLR), 2018. [47] Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks. ar Xiv preprint ar Xiv:2104.08378, 2022. [48] Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? Adaptive rounding for post-training quantization. In International Conference on Machine Learning (ICML), 2020. [49] Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. Ar Xiv, abs/2106.08295, 2021. [50] Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. Ar Xiv, abs/2106.08295, 2021. [51] Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, and Tijmen Blankevoort. Overcoming oscillations in quantization-aware training. In International Conference on Machine Learning, pages 16318 16330. PMLR, 2022. [52] Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In International Conference on Computer Vision (ICCV), 2019. [53] Satya Sai Srinath Namburi, Makesh Sreedhar, Srinath Srinivasan, and Frederic Sala. The cost of compression: Investigating the impact of compression on parametric knowledge in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5255 5273, 2023. [54] Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta. Exploring sparsity in recurrent neural networks. ar Xiv preprint ar Xiv:1704.05119, 2017. [55] Jaehyun Park and Stephen Boyd. A semidefinite programming method for integer convex quadratic minimization. Optimization Letters, 12:499 518, 2018. [56] Alberto Del Pia, Santanu S Dey, and Marco Molinaro. Mixed-integer quadratic programming is in np. Mathematical Programming, 162:225 240, 2017. [57] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. [58] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [59] Suraj Srinivas, Andrey Kuzmin, Markus Nagel, Mart van Baalen, Andrii Skliar, and Tijmen Blankevoort. Cyclical pruning for sparse neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2762 2771, 2022. [60] Mingxing Tan and Quoc Le. Efficient Net: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML), 2019. [61] Mingxing Tan, Ruoming Pang, and Quoc V. Le. Efficientdet: Scalable and efficient object detection, 2020. [62] Frederick Tung and Greg Mori. Deep neural network compression by in-parallel pruningquantization. IEEE transactions on pattern analysis and machine intelligence, 42(3):568 579, 2018. [63] Stefan Uhlich, Lukas Mauch, Fabien Cardinaux, Kazuki Yoshiyama, Javier Alonso Garcia, Stephen Tiedemann, Thomas Kemp, and Akira Nakamura. Mixed precision dnns: All you need is a good parametrization. In International Conference on Learning Representations (ICLR), 2020. [64] Mart van Baalen, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, and Max Welling. Bayesian bits: Unifying quantization and pruning. ar Xiv preprint ar Xiv:2005.07093, 2020.

[65] Haichuan Yang, Shupeng Gui, Yuhao Zhu, and Ji Liu. Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2178 2188, 2020. [66] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9194 9203, 2018. [67] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022. [68] Xiangyu Zhang, Jianhua Zou, Kaiming He, and Jian Sun. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38(10):1943 1955, 2015. [69] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ar Xiv preprint ar Xiv:1606.06160, 2016. [70] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. ar Xiv preprint ar Xiv:1710.01878, 2017.