# automatic_mixedprecision_quantization_search_of_bert__e1aedec2.pdf

Automatic Mixed-Precision Quantization Search of BERT

Changsheng Zhao , Ting Hua , Yilin Shen , Qian Lou , Hongxia Jin Samsung Research America {changsheng.z, ting.hua, yilin.shen, qian.lou, hongxia.jin}@samsung.com

Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. However, these models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices. Knowledge distillation, Weight pruning, and Quantization are known to be the main directions in model compression. However, compact models obtained through knowledge distillation may suffer from signiﬁcant accuracy drop even for a relatively small compression ratio. On the other hand, there are only a few quantization attempts that are speciﬁcally designed for natural language processing tasks. They suffer from a small compression ratio or a large error rate since manual setting on hyper-parameters is required and ﬁne-grained subgroup-wise quantization is not supported. In this paper, we proposed an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level. Speciﬁcally, our proposed method leverages Differentiable Neural Architecture Search to assign scale and precision for parameters in each subgroup automatically, and at the same time pruning out redundant groups of parameters. Extensive evaluations on BERT downstream tasks reveal that our proposed method outperforms baselines by providing the same performance with much smaller model size. We also show the feasibility of obtaining the extremely light-weight model by combining our solution with orthogonal methods such as Distil BERT.

1 Introduction

Transformer based architectures such as BERT [Devlin et al., 2019], have achieved signiﬁcant performance improvements over traditional models in a variety of Natural Language Processing tasks. These models usually require long inference time and huge model size with million parameters despite their success. For example, an inference of BERTBase model

involves 110 million parameters and 29G ﬂoating-point operations. Due to these limitations, it is impractical to deploy such huge models on resource-constrained devices with tight power budgets. Knowledge distillation, Pruning, and Quantization are known to be three promising directions to achieve model compression. Although compression technologies have been applied to a wide range of computer vision tasks, they haven t been fully studied in natural language processing tasks. Due to the high computation complexity of pre-trained language models, it is nontrivial to explore the compression of transformer-based architectures. Knowledge distillation is the most popular approach in the ﬁeld of pre-trained language model compression. Most current work in this direction usually reproduces the behavior of a larger teacher model into a smaller lightweight student model [Sun et al., 2019; Sanh et al., 2019; Jiao et al., 2019]. However, the compression efﬁciency of these models is still low and signiﬁcant performance degradation is observed even at a relatively small compression ratio. Pruning a neural network means removing some neurons within a group. After the pruning process, these neurons weights are all zeros, which decreases the memory consumption of the model. Several approaches have explored weight pruning for Transformers [Gordon et al., 2020; Kovaleva et al., 2019; Michel et al., 2019], which mainly focus on identifying the pruning sensitivity of different parts. Quantization is a model-agnostic approach that is able to reduce memory usage and improve inference speed at the same time. Compared to Knowledge distillation and Pruning, there are much fewer attempts based on quantization for Transformer compression. The state-of-the-art quantization method Q-BERT adapts mixed-precision quantization for different layers [Shen et al., 2019]. However, Q-BERT determines the quantization bit for each layer by hand-crafted heuristics. As BERTbased models become deeper and more complex, the design space for mixed-precision quantization increases exponentially, which is challenging to be solved by heuristic, handcrafted methods. In this paper, we proposed an automatic mixed-precision quantization approach for BERT compression (AQ-BERT). Beyond layer-level quantization, our solution is a group-wise quantization scheme. Within each layer, our method can automatically set different scales and precision for each neu-

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

ron sub-groups. Unlike Q-BERT that requires a manual setting, we utilize differentiable network architecture searches to make the precision assignments without additional human effort. Our contributions can be summarized as follows. Proposal of an united framework to achieve automatic parameter search. Unlike existing approaches, AQ-BERT does not need hand-craft adjustments for different model size requirements. This is achieved by designing a two-level network to relax the precision assignment to be continuous, which can therefore be optimized by gradient descent. Proposal of a novel objective function that can compress model to the desirable size, and enables pruning and quantization simultaneously. Given a targeted model size, our AQ-BERT aims to search for the optimal parameters, regularized by minimizing the cost computation. As the the cost is a group Lasso regularizer, this design also enables a joint pruning and quantization. Provide efﬁcient solutions to optimize the parameters for the proposed framework. The optimization towards the objective of the proposed framework is non-trivial, as the operations for quantization are non-differentiable. Extensive experimental validation on various NLP tasks. We evaluate the proposed AQ-BERT on four NLP tasks, including Sentiment Classiﬁcation, Question answer, Natural Language Inference, and Named Entity Recognition. The results demonstrate that our AQ-BERT achieves superior performance than the state-of-the-art method Q-BERT. And we also show that the orthogonal methods based on knowledge distillation can further reduce the model size.

2 Related Work Our focus is model compression for Transformer encoders such as BERT. In this section, we ﬁrst discussed the main branches of compression technologies for general purpose, then we reviewed existing work speciﬁcally designed for compressing Transformers. Besides compression technologies, our proposed method is also related to the ﬁled of Network architecture search.

2.1 General Model Compression A traditional understanding is that a large number of parameters is necessary for training good deep networks [Zhai et al., 2016]. However, it has been shown that many of the parameters in a trained network are redundant [Han et al., 2015]. Many efforts have been made to model compression, in order to deploy efﬁcient models on the resource-constrained hardware device. Automatic network pruning is one promising direction for model compression, which removes unnecessary network connections to decrease the complexity of the network. This direction can be further divided into pruning through regularization and network architecture search. Pruning based on regularization usually adds a heuristic regularizer as the penalty on the loss function [Molchanov et al., 2017; Louizos et al., 2018]. While pruning through network architecture search aims to discover the important topology structure of a given network [Dong and Yang, 2019; Frankle and Carbin, 2018].

Another common strategy is weight quantization, which constrains weight to a set of discrete values. By representing the weights with fewer bits, quantization approaches can reduce the storage space and speed up the inference. Most of the quantization work assign the same precision for all layers of a network [Rastegari et al., 2016; Choi et al., 2018]. And the few attempts on mixed-precision quantization are usually on layer-level [Zhou et al., 2018; Wu et al., 2018], without support for the assignments on the ﬁner level such as sub-groups. Besides, knowledge distillation is also a popular direction for model compression [Hinton et al., 2015], which learns a compact model (the student) by imitating the behavior of a larger model (the teacher).

2.2 Transformers Compression Existing attempts on compressing Transformers are mainly based on knowledge distillation [Sanh et al., 2019; Liu et al., 2019; Jiao et al., 2019], which is the orthogonal direction to our solution. Most technologies based on pruning and quantization mentioned above, are deployed to convolutional networks, while only a few works are designed for deep language models such as Transformers. And the majority of these work focus on heuristic pruning [Kovaleva et al., 2019; Michel et al., 2019] or study effects of pruning at different levels [Gordon et al., 2020]. Q-BERT is most related to our work that is also a mixed-precision quantization approach designed for BERT [Shen et al., 2019]. However, they require extensive efforts to manually set the hyper-parameters, which is infeasible for practical usage.

2.3 Network Architecture Search The problem of network compression can also be viewed as a type of sparse architecture search. Although most previous research on network architecture search [Zoph and Le, 2016; Hu et al., 2020] can automatically discover the topology structure of deep neural networks, they usually require huge computational resources. To reduce the computational cost, ENAS [Pham et al., 2018] shares the weights of a super network to its child network, and utilizes reinforcement learning to train a controller to sample better child networks. DARTS [Liu et al., 2018] also is a two-stage NAS that leverages the differentiable loss to update the gradients. We inherit the idea of a super network from them to the ﬁeld of model compression, by simultaneously conducting quantization and pruning during the architecture search process.

3 Methods Given a large model M, our goal is to obtain a compact model M with desirable size V, by automatically learning the optimal bit-assignment set O and weight set ω . To achieve this goal, we have to solve the following challenges: 1. How to ﬁnd the best bit assignment automatically? 2. Is it possible to achieve pruning and quantization simultaneously? 3. How to compress the model to a desirable size? 4. Bit assignments are discrete operations, how to achieve back propagation under this condition?

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Probability

Super Network

Inner Network

Figure 1: Framework. The central part shows the idea of two-stage optimization. The left part is the illustration of the inner training network, while the right part is an example of the super network that controls bit-assignment. In the left part, each node represents a matrix (a group of neurons), which we call a sub-group in our paper. Each sub-group has its own quantization range in the mixed-precision setting. As the example shown in the right part, a sub-group has three choices of bit-assignment: 0-bit, 2-bit, and 4-bit. And each such assignment is associated with a probability of being selected.

5. How to efﬁciently infer the parameters for bit-assignment set and weight set together? This section discusses the solutions to all these challenges. Our proposed framework makes it possible to automatically search for the best assignment. Through the carefully designed loss function, a large model can be reduced to a compact model with pre-given desirable size via a compressing process that can conduct pruning and quantization at the same time. We show how to make the whole process differentiable through the description of quantization and continuous relaxation. Finally, we discuss the optimization process and provide an overall algorithm.

3.1 Framework Figure 1 illustrates our key ideas. Speciﬁcally, our framework includes two networks: a weight-training inner network (the left part), and a bit-assignment super network (the right part). The weight-training inner network can be viewed as a regular neural network that optimizing weights, except that each node represents a subgroup of neurons (the slices of dense) rather than a single neuron. As shown in the right part of Figure 1, for a subgroup j in layer i, there could be K different choices of precision, and the k-th choice is denoted as bi,j k (e.g., 2 bit). For example, in Figure 1, each subgroup has 3 choices of bit-width: 0 bit, 2-bit, and 4-bit. Correspondingly, the probability of choosing a certain precision is denoted as pi,j k , and the bit assignment is a one-hot variable Oi,j k . It is obvious that P k pi,j k = 1 and only one precision is selected at a time. Remember that our goal is achieved by jointly learning the bit assignments O and the weights ω within all the mixed operations. The super network will update the bit assignment set O by calculating the validation loss function Lval. And inner training network will optimize weights set ω through

loss function Ltrain based on cross-entropy. We will introduce the details of the loss function in the following section. As can be seen from the above discussion, the two-stage optimization framework enables the automatic search for the bit-assignment, which is originally manually set up in QBERT.

3.2 Objective Function

As stated above, we will jointly optimize the bit-assignment set O and weight set ω. Both validation loss Lval and training loss Ltrain are determined not only by the bit assignment O, but also the weights ω in the network. The goal for bitassignment search is to ﬁnd the best O that minimizes the validation loss Lval(ω O, O), where optimal weight set ω associated with the bit assignments are obtained by minimizing the training loss Ltrain(O , ω). This is a two-level optimization problem that bit-assignment set O is upper-level variable, and weight set ω is lower-level variable:

min O Lval(ω , O) (1)

s.t. ω = arg min ω Ltrain(ω, O) (2)

The training loss Ltrain(O , ω) is a regular cross-entropy loss. The validation loss Lval contains both classiﬁcation loss and the penalty for the model size:

Lval = log exp(ψy) P|ψ| j=1 exp(ψy) + λLsize (3)

ψy is the output logits of the network, where y is the ground truth class, and λ is the weight of penalty. We can conﬁgure the model size through the penalty Lsize, which encourages

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

the computational cost of the network to converge to a desirable size V. Speciﬁcally, the computation cost Lsize is calculated as follows:

log E[CO] CO > (1 + ϵ) V 0 CO [(1 ϵ) V, (1 + ϵ) V] log E[CO] CO < (1 ϵ) V (4)

k ||bi,j k Oi,j k ||2 (5)

k pi,j k ||bi,j k Oi,j k ||2 (6)

CO is the actual size of the model with bit-assignment O, which is a group Lasso regularizer. For a sub-group j on layer i, there is a possibility that its optimal bit-assignment is zero. In this case, the bit-assignment is equal to pruning that removes this sub-group of neurons from the network. Toleration rate ϵ [0, 1] restricts the variation of model size is around the desirable size V. E[CO] is the expectation of the size cost CO, where the weight is the bit-assignment probability. The statement and analysis above show the carefully designed validation loss Lval provides a solution to solve the ﬁrst and the second challenge simultaneously. Speciﬁcally, it can conﬁgure the model size according to the user-speciﬁed value V through piece-wise cost computation, and provide a possibility to achieve quantization and pruning together via group Lasso regularizer.

3.3 Quantization Process Traditionally, all the weights in a neural network are represented by full-precision ﬂoating point numbers (32-bit). Quantization is a process that converts full-precision weights to ﬁxed-point numbers with lower bit-width, such as 2,4,8 bits. In mixed-precision quantization, different groups of neurons can be represented by different quantization range (number of bits). If we denote the original ﬂoating-point sub-group in the network by matrix A, and the number of bits used for quantization by b, then we can calculate its own scale factor q A R+ as follows:

q A = 2b 1 max(A) min(A). (7)

And a ﬂoating-point element a A that can therefore be estimated by the scale factor and its quantizer Q(a) such that a Q(a)/q A. Similar to Q-BERT, the uniform quantization function is used to evenly split the range of ﬂoating point tensor [Hubara et al., 2017]:

Q(a) = round(q A [a min(A)]). (8)

The quantization function is non-differentiable, therefore the Straight-through estimator (STE) method is needed here to back-propogate the gradient [Bengio et al., 2013], which can be viewed as an operator that has arbitrary forward and backward operations:

Forward : ˆωA = Q(ωA)/qωA (9)

Backward : Ltrain

ˆωA = Ltrain

Speciﬁcally, the real-value weights ωA are converted into the fake quantized weights ˆω during forward pass, calculated via Equation 7 and 8. And in the backward pass, we use the gradient ˆω to approximate the true gradient of ω by STE.

3.4 Continuous Relaxation Another challenge is that mixed-precision assignment operations are discrete variables, which are non-differentiable and therefore unable to be optimized through gradient descent. In this paper, we use concrete distribution to relax the discrete assignments by using Gumbel-softmax:

Oi,j k = exp((log βi,j k + gi,j k )/t) P k exp((log βi,j k + gi,j k )/t)

s.t. gi,j k = log( log(u)), u U(0, 1)

t is the softmax temperature that controls the samples of Gumbel-softmax. As t , Oi,j k is close to a continuous variable following a uniform distribution, while t 0, the values of Oi,j k tends to be one-shot variable following the categorical distribution. In our implementation, an exponentialdecaying schedule is used for annealing the temperature:

t = t0 exp( η (epoch N0)), (12)

where t0 is the initial temperature, N0 is the number of warm up epoch, and the current temperature decays exponentially after each epoch. The utilization of the Gumbel Softmax trick effectively renders our proposed AQ-BERT into a differentiable version.

3.5 Optimization Process The optimizations of the two-level variables are non-trivial due to a large amount of computation. One common solution is to optimize them alternately, that the algorithm infers one set of parameters while ﬁxing the other set of parameters. Previous work usually trains the two levels of variables separately [Xie et al., 2018], which is computationally expensive. We adopted a faster inference that can simultaneously learn variables of different level [Liu et al., 2018; Luketina et al., 2016]. In this paper, validation loss Lval is determined by both the lower-level variable weight ω and the upper-level variable bit assignments O.

OLval(ω , O) (13) OLval(ω ξ ωLtrain, O) (14)

It is generally believed that hyper-parameter set O should be kept ﬁxed during the training process of inner optimization (Equation 2). However, this hypothesis is shown to be unnecessary that it is possible to change hyper-parameter set during the training of inner optimization [Luketina et al.,

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Model Size/MB SST-2(Acc) MNLI-m (Acc) MNLI-mm (Acc) SQu AD(EM) SQu AD(F1) Co NLL(F1) BERTbase 324.5 93.50 84.00 84.40 81.54 88.69 95.00

Q-BERT 30 92.50 83.50 83.50 79.07 87.49 94.55 Ours 30 92.70 83.50 83.70 79.85 87.00 94.50

Q-BERT 25 92.00 81.75 82.20 79.00 86.95 94.37 Ours 25 92.50 82.90 82.90 79.25 87.00 94.40

Q-BERT 20 84.60 76.50 77.00 69.68 79.60 91.06 Ours 20 91.10 81.80 81.90 75.00 83.50 93.20

Table 1: Quantization results of Q-BERT and our method for BERTbase on Natural Language Understanding tasks. Results are obtained with 128 groups in each layer. Both Q-BERT and our method are using 8-bits activation. All model sizes reported here exclude the embedding layer, as we uniformly quantized embedding by 8-bit.

Algorithm 1: The Procedure of AQ-BERT

Input: training set Dtrain and validation set Dval 1 for epoch=0,...,N do

2 get current temperature via Equation 12

3 Calculate Ltrain on Dtrain to update weights ω

4 if epoch > N1 then

5 Calculate Lval on Dval via Equation 14 to update bit assignments O

6 Derive the ﬁnal weights based on learned optimal bit assignments O

Output: optimal bit assignments O and weights ω

2016]. Speciﬁcally, as shown in Equation 14, the approximation ω is achieved by adapting one single training step ω ξ ωLtrain. If the inner optimization already reaches a local optimum ( ωLtrain 0), then Equation 14 can be further reduced to OLval(ω, O). Although the convergence is not guaranteed in theory [Luketina et al., 2016], we observe that the optimization process is able to reach a ﬁxed point in practice. The details can be found in the supplementary material.

3.6 Overall Algorithm Based on the statements above, Algorithm 1 summarizes the overall process of our proposed method. 1. As shown in line 2 Algorithm 1, in the beginning of each epoch, the bit-assignment is relaxed to continuous variables via Equation 11, where the temperature is calculated through Equation 12. After this step, both weight and bitassignment are differentiable. 2. Then, Ltrain is minimized on the training set to optimize the weights (line 3). 3. To ensure the weights are sufﬁciently trained before the updating of the bit assignments, we delay the training of Lval on the validation set for N1 epochs (line 4 and 5). For each sub-group, the number of bits with maximum probability is chosen as its bit assignment. 4. After sufﬁcient epochs, we are supposed to obtain a set of bit assignment that is close to the optimal. Based on current assignments, we then randomly initialize the weights of the inner network, and train it from scratch (line 6). With these steps, we can obtain the outputs of the whole

learning procedure, which contains the optimized bit assignments and weight matrices.

4 Experimental Evaluation

In this section, we evaluate the proposed AQ-BERT from the following aspects: 1. How does AQ-BERT perform comparing to state-of-theart BERT compression method based on quantization (e.g., Q-BERT)? 2. How do the parameter settings affect the performance of AQ-BERT ? 3. Is it possible to integrate the proposed AQ-BERT with an orthogonal method based on knowledge distillation? To answer these questions, we ﬁrst compare our AQ-BERT with baseline under different constraints of model size, then we study the effect of group numbers towards performance, and ﬁnally, we present the results of integrating our method with the knowledge distillation approach.

4.1 Datasets and Settings

We evaluate our proposed AQ-BERT and other baselines (bert-base, Q-BERT, and Distilbert-base) on four NLP tasks: SST-2, MNLI, Co NLL-2003, and SQu AD. Our implementation is based on transformers by huggingface1. The Adam W optimizer is set with learning rate 2e 5, and SGD is set with learning rate 0.1 for architecture optimization.

4.2 Main Results

In this section, we report the results of comparing our proposed method with baselines on the development set of the four tasks: SST-2, MNLI, Co NLL-03, and SQu AD. 4.2.0Performance on different sizes Table 1 compares our results with Q-BERT on four NLP tasks: SST-2, MNLI, Co NLL-03, and SQu AD. Several observations can be made from the table as follows:

Overall comparison. As can be seen from the table, in all four tasks, AQ-BERT generally performs better than QBERT, regardless of the compressed model size. And the performance gap between our method and Q-BERT becomes

1https://github.com/huggingface/transformers

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Model Group SST-2 MNLIm MNLImm CONLL

BERTbase N/A 93.00 84.00 84.40 95.00

Q-BERT 1 85.67 76.69 77.00 89.86 Ours 1 89.60 77.70 78.20 91.90

Q-BERT 12 92.31 82.37 82.95 94.42 Ours 12 92.70 83.50 83.70 94.80

Q-BERT 128 92.66 83.89 84.17 94.90 Ours 128 92.90 83.40 83.90 95.00

Q-BERT 768 92.78 84.00 84.20 94.99 Ours 768 92.90 83.70 84.10 95.10

Table 2: Effects of group-wise quantization for AQ-BERT. The quantization bits were set to be 8 for embeddings and activations on all the tasks. From top to down, we increase the number of groups. Notice that Q-BERT reports the group-wise quantization performance with a model of 40M. To make a fair comparison, we add the choice of 8-bit in this experiment to produce a model with a comparable size.

Model Size/MB SST2(Acc) MNLIm SQu AD (EM) SQu AD (F1)

BERT 324.5 93.50 84.00 81.54 88.69

Distil BERT 162 91.30 82.20 77.70 85.80

15 91.20 81.50 72.80 82.10 Distil BERT 12.5 90.70 80.00 72.70 82.10 +Ours 10 89.70 78.00 68.30 78.30

Table 3: Results of combining our proposed AQ-BERT with knowledge distillation model Distil BERT. All the model sizes reported in this table exclude the embedding layer, as we uniformly quantized embedding to 8-bit.

more obvious as the model size decreases. These observations indicate that, compared to Q-BERT, our proposed AQBERT can learn parameters correctly and perform stably on various tasks. Obvious advantage in ultra-low bit setting. Our advantage is more obvious for ultra-low bit setting. Speciﬁcally, when the model size is as small as 20M, AQ-BERT achieves signiﬁcant improvements over Q-BERT, as measured by the difference on development set scores for four representative NLP tasks: SST-2 (+6.5%), MNLI (+5.3%), SQu AD (+5.3%), Co NLL (+2.1%). This phenomenon further conﬁrms the effectiveness of our automatic parameter search. As the model size decreases, the optimal set of parameters becomes tighter . In this situation, it is less likely to ﬁnd good settings for parameters through manual assignment adopted by baseline Q-BERT.

Effects of Group-wise Quantization Table 2 shows the performance gains with different group numbers. A larger group number means each sub-group in the network contains fewer neurons. For example, in the setting of the 12-group, each group contains 768/12 = 64 neurons, while in the setting of the 128-group, each group only contains six neurons. Several observations can be made from

Table 2 as follows: Overall comparison. As can be seen from the table, in terms of accuracy, our proposed AQ-BERT is better than the baseline Q-BERT, under most settings of group numbers. Larger group number will result in better performance. Theoretically, the setting with a larger group number will result in better performance, as there will be more ﬂexibility in weight assignments. This hypothesis is conﬁrmed by the results shown in Table 2 that the performance signiﬁcantly grows as the increase of group numbers. For example, by changing the group number from 1 to 128, the performance of both our AQ-BERT and baseline method increase by at least 2%. The trade-off between performance and complexity. Although a larger group number will always bring better performance, such improvement is not without a cost, as the model have to infer more parameters. And as can be seen from the Table, the growth of improvement is slowed down as the number of groups increases. For example, there is at least 2% improvement when increasing the number of groups from 1 to 128. However, only 0.1% performance gain is obtained when we further increase the group numbers from 128 to 768.

Combination with Knowledge Distillation Knowledge distillation methods are orthogonal to the present work. The results of combining our method and knowledge distillation method Distil BERT are shown in Table 3. As can be seen from the table, Distil BERT will result in great performance loss even at a small compression ratio. For example, in SST-2 dataset, the original Distil BERT brings more than 2% performance drop in accuracy, but only reduces the size of base BERT in half. In contrast, after integrating with our AQ-BERT , the model is further compressed to 1/20 of the original BERT, with only 0.1% extra performance loss compared to using Distil BERT alone. This phenomenon indicates two practical conclusions: It is safe to integrate our AQ-BERT with knowledge distillation methods to achieve the extreme light-weight compact model. Compared to the method based on knowledge distillation, our proposed quantization method is more efﬁcient, with a relatively higher compression ratio and a lower performance loss.

5 Conclusion Recently, the compression of large Transformer-based models has attracted more and more research attentions. In this work, we proposed a two-level framework to achieve automatic mixed-precision quantization for BERT. Under this framework, both the weights and precision assignments are updated through gradient-based optimization. The evaluation results show that our proposed AQ-BERT is always better than baseline Q-BERT in all four NLP tasks, especially in the ultra-low bit setting. Also, our AQ-BERT is orthogonal to the knowledge distillation solutions, which can together bring in extreme light-weight compact models with little performance loss. These advantages make our method a practical solution for the resource-limited device (e.g., smartphone).

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

References [Bengio et al., 2013] Yoshua Bengio, Nicholas L eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. In ar Xiv preprint ar Xiv:1308.3432, 2013. [Choi et al., 2018] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. In ar Xiv preprint ar Xiv:1805.06085, 2018. [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171 4186, 2019. [Dong and Yang, 2019] Xuanyi Dong and Yi Yang. Network pruning via transformable architecture search. In Neur IPS, pages 759 770, 2019. [Frankle and Carbin, 2018] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2018. [Gordon et al., 2020] Mitchell A Gordon, Kevin Duh, and Nicholas Andrews. Compressing bert: Studying the effects of weight pruning on transfer learning. In ar Xiv preprint ar Xiv:2002.08307, 2020. [Han et al., 2015] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ar Xiv preprint ar Xiv:1510.00149, 2015. [Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In stat, volume 1050, page 9, 2015. [Hu et al., 2020] Shoukang Hu, Sirui Xie, Hehui Zheng, Chunxiao Liu, Jianping Shi, Xunying Liu, and Dahua Lin. Dsnas: Direct neural architecture search without parameter retraining. In ar Xiv preprint ar Xiv:2002.09128, 2020. [Hubara et al., 2017] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. In JMLR, volume 18, pages 6869 6898, 2017. [Jiao et al., 2019] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. In ar Xiv preprint ar Xiv:1909.10351, 2019. [Kovaleva et al., 2019] Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. Revealing the dark secrets of bert. In ar Xiv preprint ar Xiv:1908.08593, 2019. [Liu et al., 2018] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In ar Xiv preprint ar Xiv:1806.09055, 2018. [Liu et al., 2019] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In ACL, pages 4487 4496, 2019.

[Louizos et al., 2018] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l 0 regularization. In ICLR, 2018. [Luketina et al., 2016] Jelena Luketina, Mathias Berglund, Klaus Greff, and Tapani Raiko. Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML, pages 2952 2960, 2016. [Michel et al., 2019] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In Neur IPS, pages 14014 14024, 2019. [Molchanov et al., 2017] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsiﬁes deep neural networks. In ICML, pages 2498 2507, 2017. [Pham et al., 2018] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efﬁcient neural architecture search via parameter sharing. In ar Xiv preprint ar Xiv:1802.03268, 2018. [Rastegari et al., 2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classiﬁcation using binary convolutional neural networks. In ECCV, pages 525 542. Springer, 2016. [Sanh et al., 2019] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In ar Xiv preprint ar Xiv:1910.01108, 2019. [Shen et al., 2019] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. In ar Xiv preprint ar Xiv:1909.05840, 2019. [Sun et al., 2019] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model compression. In ar Xiv preprint ar Xiv:1908.09355, 2019. [Wu et al., 2018] Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. Mixed precision quantization of convnets via differentiable neural architecture search. In ar Xiv preprint ar Xiv:1812.00090, 2018. [Xie et al., 2018] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. In ar Xiv preprint ar Xiv:1812.09926, 2018. [Zhai et al., 2016] Shuangfei Zhai, Yu Cheng, Zhongfei Mark Zhang, and Weining Lu. Doubly convolutional neural networks. In Neurl PS, pages 1082 1090, 2016. [Zhou et al., 2018] Yiren Zhou, Seyed Mohsen Moosavi Dezfooli, Ngai-Man Cheung, and Pascal Frossard. Adaptive quantization for deep neural network. In AAAI, 2018. [Zoph and Le, 2016] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ar Xiv preprint ar Xiv:1611.01578, 2016.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)