# anyprecision_deep_neural_networks__6b326a59.pdf

Any-Precision Deep Neural Networks*

Haichao Yu1, Haoxiang Li2, Humphrey Shi1,3, Thomas S. Huang1, Gang Hua2

1UIUC, 2Wormpex AI Research, 3University of Oregon {haichao3, hshi10, t-huang1}@illinois.edu, haoxiang.li@bianlifeng.com, ganghua@gmail.com

We present any-precision deep neural networks (DNNs), which are trained with a new method that allows the learned DNNs to be ﬂexible in numerical precision during inference. The same model in runtime can be ﬂexibly and directly set to different bit-widths, by truncating the least significant bits, to support dynamic speed and accuracy trade-off. When all layers are set to low-bits, we show that the model achieved accuracy comparable to dedicated models trained at the same precision. This nice property facilitates ﬂexible deployment of deep learning models in real-world applications, where in practice trade-offs between model accuracy and runtime efﬁciency are often sought. Previous literature presents solutions to train models at each individual ﬁxed efﬁciency/accuracy trade-off point. But how to produce a model ﬂexible in runtime precision is largely unexplored. When the demand of efﬁciency/accuracy trade-off varies from time to time or even dynamically changes in runtime, it is infeasible to re-train models accordingly, and the storage budget may forbid keeping multiple models. Our proposed framework achieves this ﬂexibility without performance degradation. More importantly, we demonstrate that this achievement is agnostic to model architectures and applicable to multiple vision tasks. Our code is released at https://github.com/SHILabs/Any-Precision-DNNs.

Introduction

While state-of-the-art deep learning models can achieve very high accuracy on various benchmarks, runtime cost is another crucial factor to consider in practice. In general, the capacity of a deep learning model is positively correlated with its complexity. As a result, accurate models mostly run slower, consume more power, and have larger memory footprint as well as model size. In practice, it is inevitable to balance efﬁciency and accuracy to get a good trade-off when deploying any deep learning models. To alleviate this issue, a number of approaches have been proposed to address it from different perspectives. We observe active researches (Liu et al. 2018a; Cai, Zhu, and Han 2018; Chen et al. 2019) in looking for more efﬁcient

*Correspondence to H. Shi and G. Hua. This work was done during Yu s internship at Wormpex AI Research. Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Pre-trained Weights k-bit quantization, k in {1,2, ,8,32}

a) Full-Precision b) Low-Precision c) Mixed-Precision

d) Any-Precision

Figure 1: Illustrations of deep neural networks (DNN) in different numerical precisions: a) weights and activations of typical full-precision DNN are in 32-bit ﬂoating values; b) binary DNN with 1-bit weights and activations as an example low-precision DNN; c) different layers in mixedprecision DNN can be in arbitrary bit-width and ni is ﬁxed after training; d) the proposed any-precision DNN can have pre-trained weights in full-precision while in runtime the weights and activations can be quantized into arbitrary bitwidth k.

deep neural network architectures to support practical usage (Howard et al. 2019; Yu and Huang 2019; Tan and Le 2019). People also consider to adaptively modify general deep learning model inference to dynamically determine the execution during the feed-forward pass to save some computation at the cost of potential accuracy drop (Figurnov et al. 2017; Teerapittayanon, Mc Danel, and Kung 2016; Wu et al. 2018; Veit and Belongie 2018). Besides these explorations, another important line of research proposes a low-level solution to use less bits to represent deep learning model and its runtime data to achieve largely reduced runtime cost. It has been shown in various literatures that full-precision is over-abundant in many applications that we can use 8-bit or even 4-bit models without obvious performance degradation. Some previous works went further in this direction. For example, BNN, XNOR-Net, and others (Courbariaux et al. 2016; Rastegari et al. 2016; Zhou et al. 2016) are proposed to use as low as 1-bit for both the weights and acti-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

vations of the deep neural networks to reduce power-usage, memory-footprint, running time, and model size. However, ultra low-precision models always observe obvious accuracy drop (Courbariaux et al. 2016). While many methods have been proposed to improve accuracy of the lowprecision models, so far we see no silver bullet. Stepping back from uniformly ultra-low precision models, mixedprecision models have been proposed to serve as a better trade-off (Wang et al. 2019; Dong et al. 2019). Effective ways have been found to train accurate models with some layers processing in ultra-low precision and some layers in high precision.We illustrate these different paradigms in Figure 1.a-c. When we look at this spectrum of deep learning models in terms of its numerical precision, from full-precision at one end to low-precision at the other, and mixed-precision in between, we have to admit that efﬁciency/accuracy trade-off always exists in reality and to deploy a model in a speciﬁc application scenario we have to ﬁnd the right trade-off point. Previous methods can provide a speciﬁc operating point but what if we demand ﬂexibility as well? It would be a highly favorable property if we can dynamically change the efﬁciency/accuracy trade-off point given a single model. Preferably, we want to be able to adjust the model, without the need of re-training or re-calibration, to run in high accuracy mode when resources are sufﬁcient and switch to low accuracy mode when resources are limited. In this paper, we propose a method to train deep learning models to be ﬂexible in numerical precision, namely anyprecision deep neural networks. After training, we can freely quantize the model layers into various precision levels, without ﬁne-tuning or calibration and without any data. We illustrate this in Figure 1.d. When running in low-precision, fullprecision or other precision levels in between, it achieves comparable accuracy to models speciﬁcally trained under the matched settings. Furthermore, given ﬁxed computational budget, it can potentially ﬁnd better operating point than one trained rigorously. To summarize, our contributions are: We introduce the concept of any-precision DNN. In runtime we can quantize its layers into different bit-widths. Its accuracy changes smoothly with respect to its precision level without drastic performance degradation; We propose a novel model-agnostic method to train anyprecision DNN and validate its effectiveness over multiple vision tasks, with multiple widely used benchmarks, and with multiple neural network architectures; We demonstrate the proposed training framework trains better low-bit models with knowledge distillation.

Related Work Low-Precision Deep Neural Networks. Recent progresses in deep learning inference hardware motivate the research of using low-bit integer instead of ﬂoat-point values to represent network weights and activations. Binarized Neural Networks (Courbariaux et al. 2016) and XNORNet (Rastegari et al. 2016) are early works in this direction to use only 1-bit to represent the weights and activations

in DNNs. When training these 1-bit networks, a ﬂoat-point value copy of the parameters are maintained under the hood to calculate approximated gradients. Usually a sign function is used to quantize the ﬂoat-point value copy to binary value in the feed-forward pass. Using only 1-bit numerical precision leads to obvious drop in accuracy in most scenarios, Zhou et al. (Zhou et al. 2016) proposed Do Re Fa-Net to speciﬁcally train arbitrary bitwidth in weights, activations, and gradients. Since gradients are also in low-bits, proper implementation could accelerate both the forward and backward passes. One of the essential problem in learning low-precision DNNs is the quantization operator. Quantization of the realvalue parameters in the feed-foward pass and approximation of the gradients through the quantization operator in the backward pass heavily inﬂuence the ﬁnal model accuracy. For example, the sign function adopted in Binarized NN (Courbariaux et al. 2016) discards the value distribution variations across layers and hurt the performance. In XNORNet (Rastegari et al. 2016), a scaling factor is added to each layer to minimize the information loss. Choi et al. (Choi et al. 2018) proposed a parameterized clipping activation for quantization to support arbitrary bits quantization of activation. Zhang et al. (Zhang et al. 2018) and Jung et al. (Jung et al. 2019) pointed out that having an uniform quantization pattern across layers is suboptimal and propose a learnable quantizer for each layer to improve the model accuracy. In the backward pass, most prior works use the Straight Through Estimator (STE) (Bengio, L eonard, and Courville 2013) to approximate the gradients over the quantizers. Cai et al. (Cai et al. 2017) proposed to use a half-wave gaussian quantization operator to replace the sign function for better learning efﬁciency and a piece-wise continuous function in the backpropagation step to alleviate the gradient mismatch issue in the prior design. Liu et al. (Liu et al. 2018b) also attacked the gradient mismatch problem by introducing a piecewise polynomial function to approximate the sign function. Another interesting recent work from Ding et al. (Ding et al. 2019) addressed this problem by introducing a new loss function over the value distribution of layer activations. Besides the performance gap to full-precision model, training binary networks have been reportedly to be unstable. Tang et al. (Tang, Hua, and Wang 2017) carefully analyzed the training process and concluded that using PRe Lu (He et al. 2015) activation function, a low learning rate, and the bipolar regularization on weights could lead to a more stable training process with better optimum. Zhuang et al. (Zhuang et al. 2018) looked at the overall training strategy and propose a progressive training process. They suggested to ﬁrst train the net with quantized weights and then quantized activations, ﬁrst train with high-precision and then low-precision, and jointly train the low-bit model with the full-precision one. A similar joint training strategy has been observed to be effective in this work as well. Since our work is along an orthogonal direction of low-precision DNNs training and design, our method can be complementary to train better and ﬂexible DNNs.

Post-training Quantization. Quantization of a pretrained model with ﬁne-tuning or calibration on a dataset is another related research topic in the area. Although methods in this area are working on a different problem from ours, we partially share the motivation to have the ﬂexibility of quantization control in the runtime. Without special treatment, many models collapse even in 8-bit precision in post-training quantization. One recent work from Nagel et al. (Nagel et al. 2019) identiﬁed two issues leading to the large accuracy drop, the large variation in the weight ranges across channels and biased output errors due to quantization errors affecting following layers. With their method, they are able to alleviate the bias and equalize the weight ranges by rescaling and reparameterization. In this paper, the model we produced can be readily quantized into lower precision without further process. In the research area of deep neural networks architecture search, the slimmable neural networks by Yu et al. (Yu et al. 2018) is related to ours in terms of methodology. They presented method to train a single neural network with adjustable number of channels in each layer at runtime. Their exploration is limited to the search space of network architecture instead of weights.

Any-Precision Deep Neural Networks Overview Neural networks are generally constructed layer by layer. We denote input to the i-th layer in a neural network model as xi, the weights of the layer as wi and the biases as bi. The output from this layer can be calculated as yi = F(xi|wi, bi). (1) Without loss of generality, we take one channel in a fullyconnected layer as a concrete example in the following description and drop the subscript i for simplicity, i.e., y = w x + b, (2) where y, w RD and b is a scalar. For better computation efﬁciency, we would like to avoid the ﬂoat-value dot product of D-dimensional vectors. Instead we use N-bit ﬁxed-point integers to represent the weights as w Q and input activations as x Q. Hereafter, we assume w Q and x Q are stored as signed integers in its bitwise format. Note that in some related works (Courbariaux et al. 2016), elements of w Q and x Q could be represented as vectors of { 1, 1} and the conversion between these two formats are trivial. With N-bit integers weights and activations, as discussed in prior arts (Zhou et al. 2016; Rastegari et al. 2016), the computation can be accelerated by leveraging bit-wise operations (and, xnor, bit-count), or even dedicated DNN hardwares. Early works (Tang, Hua, and Wang 2017) show that by adding a layer-wise real-value scaling factor s could largely help reduce the output range variation and hence achieve better model accuracy. Since the scaling factor is shared across channels within the same layer, the computational cost is fractional. Following this setting, with the quantized weights and inputs, we have y = s (w Q x Q) + b. (3)

Convolutional Kernels

Scaling Factor & Bias

One layer in a trained Resnet-20

8-bit: 1 0 0 1 0 1 0 0

4-bit: 1 0 0 1

Quantized Weight in varied bits

Figure 2: Quantization of a kernel weight in the trained model into different precision levels: since we follow an uniform quantization pattern, when representing weight values in signed integers, the quantization can be implemented as simple bit-shift.

The activations y are then quantized into N-bit ﬁxed-point integers as the input to the next layer.

Inference We will discuss our quantization functions in details in the next section. Here we describe the runtime of a trained anyprecision DNN. Once training is ﬁnished, we can keep the weights at a higher precision level for storage, for example, at 8-bit. As shown in Figure 2, we can simply quantize the weights into lower bit-width by bit-shifting. We experimentally observe that with the proposed training framework, the model accuracy changes smoothly and consistently on-par or even outperform dedicated models trained at the same bit-width.

Training A number of quantization functions have been proposed in the literature for weights and activations respectively. Given a pre-trained DNN model, one can quantize its weights into low-bit and apply certain quantization function to activations accordingly. However, when the number of bits gets smaller, the accuracy quickly drops due to the rough approximation in weights and large variations in activations. The most widely adopted framework to obtain low-bit model is quantization-aware training. The proposed method follows the quantization-aware training framework. We take the same fully-connected layer as an example. In training, we maintain the ﬂoat-point value weights w for the actual layer weights w Q. In the feed-forward pass, given input x Q, we follow Equation 3 to compute the raw output y . Prior arts show the importance of the batch normalization (BN) (Ioffe and Szegedy 2015) layer in low-precision DNN training and we follow accordingly. y is then passed into a BN layer and then quantized into y Q as the input to the next layer.

Weights. We use a uniform quantization strategy similar to Zhou et al. (Zhou et al. 2016) with a scaling factor to approximate the weights. Given the ﬂoating point weight w, we ﬁrst apply the tanh function to normalize it into [ 1, 1] and then transform it into w [0, 1], i.e.,

w = tanh(w) 2max(|tanh(w)|) + 0.5. (4)

We then quantize normalized value into N-bit integers w Q and scaling factor s, where

w Q = INT(round(w MAXN)) and s = 1 / MAXN. (5) Hereafter MAXN denotes the upper-bound of N-bit integer and INT( ) converts a ﬂoating point value into an integer. Finally the values are re-mapped back to approximate the range of ﬂoating point values to obtain

w Q = 2 w Q 1 and s = E(|w|) / MAXN, (6)

where E is the mean of absolute value of all ﬂoating-valued weights in the same layer. Eventually, we approximate w with s w Q and execute the feed-forward pass with the quantized weights as shown in Equation 3, the scaling factor can be applied after the dot-product of all integers vectors. In the backward pass, gradients are computed with respect to the underlying ﬂoat-value variable w and updates are applied to w as well. In this way, the relatively unreliable and nuance signals would be accumulated gradually and hence this will stabilize the overall training process. Since not all operations involved are nice smooth functions to support back-propagation, we use the straight through estimator (STE) (Bengio, L eonard, and Courville 2013) to approximate the gradients. For example, the round operation in Equation 5 has zero derivative almost everywhere. With STE, we assign round(x) / x := 1.

Activations. For activation quantization in the feedforward pass, we obtain the N-bit ﬁxed-point representation by ﬁrst clipping the value to be within [0, 1] and then

yc = clip(y , 0, 1),

y Q = INT(round(yc MAXN)) 1 MAXN , (7)

In practice, we only calculate the integer part as y Q and absorb the constant scaling factor into the persistent network parameters in the next layer. Let L denote the ﬁnal loss function and the gradient with respect to the activation y Q is then approximated to be

where L yc =

( L y , if 0 y 1, 0, otherwise. (9)

The gradient of the round function is approximated with STE to be 1.

Dynamic Model-wise Quantization. In prior lowprecision models, the bit-width N is ﬁxed during the training process. In runtime, if we alter N the model accuracy drops drastically. To encourage ﬂexibility in the produced model, here we propose to dynamically change N within the training stage to align the training and inference process. However, the distribution of activations varies under different bit-width N, especially when N is small

Figure 3: Activation distributions under different bit-widths for weights and inputs: we randomly generate a singlechannel fully-connected layer and 1000 16-dimensional inputs; we then quantize the weights and inputs into 1, 2, 4, 8 bits respectively and summarize the distributions of activations under different bit-widths; as observed in the ﬁgure, the 1-bit quantization leads to signiﬁcant distribution shift compared to 8-bit model and the discrepancy under 2-bit is also obvious.

(e.g., 1-bit), as shown in Figure 3. As a result, without special treatment, the dynamically changed N creates conﬂicts in learning the model that it fails to converge in our experiments. One of the widely adopted technique to adjust internal feature/activation distribution is Batch Normalization (Batch Norm) (Ioffe and Szegedy 2015). It works by normalizing layer output across batch dimension as following

ˆxi = γ xi µ

σ2 + ϵ + β, i = 1..B, (10)

where B is the batch size, i denotes the index within current batch, ϵ is a small value added to avoid numerical issue. µ and σ2 are mean and variance respectively deﬁned as

i=1 xi and σ2 = 1

i=1 (xi µ)2. (11)

During training, Batch Norm layer keeps calculating running averages for µ and σ2, i.e.,

µ = λµ + (1 λ)µt and σ2 = λσ2 + (1 λ)σ2 t , (12)

where µt and σ2 t are the values before the current update, the decay rate λ is a hyper-parameter set a-prior. But even with the Batch Norm layer, dynamically changed N will lead to failure of convergence in training due to the value distribution variations shown in the toy example in Figure 3. In our proposed framework, we adopted dynamically changed Batch Norm layer to work with different N in training. More speciﬁcally, assume we have a list of bitwidth candidates {nk}K k=1, we keep |K| copies of Batch Norm layer parameters and internal states ΦK k=1. When the current training iteration works with N = nk, we reset the Batch Norm layers with data from Φk to use and update the corresponded copy.

Algorithm 1 Training of the proposed any-precision DNN

Require: Given candidate bit-widths P {nk}K k=1 1: Initialize the model M with ﬂoating-value parameters 2: Initialize K Batch Norm layers: ΦK k=1 3: for t = 1, ..., Titers do 4: Sample data batch (x, y) from train set Dtrain 5: for np in P do 6: Set quantization bit-width N np 7: Feed-forward pass: ynp M(x) 8: Set Batch Norm layers: M.replace(Φp) 9: L L + loss(ynp, y) 10: end for 11: Back-propagate to update network parameters 12: end for

Similar technique has been adopted by Yu et al. (Yu and Huang 2019) when dealing with varied network architectures. Parameters of all Batch Norm layers are kept after training and used in inference. Note that compared with the total number of network parameters, the additional amount from Batch Norm layers is negligible. We summarize the proposed method in Algorithm 1. With the proposed algorithm, we can train DNN being ﬂexible for runtime bit-width adjustment. Another optional component in our method is adding knowledge distillation (Hinton, Vinyals, and Dean 2015) in training. Knowledge distillation works by matching the outputs of two networks. In training a network, we can use a more complicated model or an ensemble of models to produce soft targets by adjusting the temperature of the ﬁnal softmax layer and then use the soft targets to guide the network learning. In our framework, we apply this idea by generating soft targets from a high-precision model. More speciﬁcally, in each training iteration, we ﬁrst set the quantization bit-width to the highest candidate n K and run feed-forward pass to obtain soft targets ysoft. Later, instead of accumulating crossentropy loss for each precision candidate, we use KL divergence of the model prediction and ysoft as the loss. In our experiments, we observe that in general knowledge distillation leads to better performance at low-bit precision levels.

Experiments

We ﬁrst validate our method with several network architectures and datasets on image classiﬁcation task. These networks include a 8-layer CNN (named Model C in (Zhou et al. 2016)), Alex Net (Krizhevsky, Sutskever, and Hinton 2012), Mobile Net V2 (Sandler et al. 2018), Resnet-8, Resnet-18, Resnet-20 and Resnet-50 (He et al. 2016). The datasets include Cifar-10 (Krizhevsky, Hinton et al. 2009), Street View House Numbers (SVHN) (Netzer et al. 2011), and Image Net (Deng et al. 2009). We also evaluate our method on the image segmentation task to demonstrate its generalization.

Implementation Details

We implement the whole framework in Py Torch (Paszke et al. 2017). On Cifar-10, we train Alex Net, Mobile Net V2 and Resnet-20 models for 400 epochs with initial learning rate 0.001 and decayed by 0.1 at epochs {150, 250, 350}. On SVHN, the 8-layer CNN named CNN-8 and Resnet-8 models are trained for 100 epochs with initial learning rate 0.001 and decayed by 0.1 at epochs {50, 75, 90}. We combine the training and extra training data on SVHN as our training dataset. All models on Cifar-10 and SVHN are optimized with the Adam optimizer (Kingma and Ba 2014) without weight decay. On Image Net, we train Resnet-18 and Resnet50 dedicated models for 120 epochs with initial learning rate 0.1 decayed by 0.1 at epochs {30, 60, 85, 95, 105} with SGD optimizer. For any-precision model, we trained 80 epochs with initial learning 0.3 decayed by 0.1 at epochs {45, 60, 70}. For all models, following Zhou et al. (Zhou et al. 2016) we keep ﬁrst and last layer real-valued. In training, we train the networks with bit-width candidates {1, 2, 4, 8, 32}. Note that when the bit-width is set to 32, it is a full-precision model that we use ﬂoating-valued weights and activations. In testing, we evaluate the model at each bit-width in the list respectively. By default, we recursively add knowledge distillation (KD) in training. Concretely, we use the fullprecision model to get soft targets as supervision for the 8-bit model, the ones from the 8-bit model for the 4-bit model, so on and so forth.

Comparison to Dedicated Models

We compare our method to very competitive baseline models at each precision level. For each bit-width we tested, we dedicatedly train a low-precision model following the same training pipeline with ﬁxed bit-width for weights and activations. We compare the accuracy we obtained from our dedicated low-bit models to other recent works in this ﬁeld to make sure the baseline models are strong. For example, on Cifar-10, our 1-bit baseline achieved an accuracy of 92.07% while the recent work from Ding et al. (Ding et al. 2019) reported 89.90%. Our results are summarized in Table 1. We observed that binarizing depth-wise convolution layers in Mobile Net V2 may lead to training divergence, which is veriﬁed by Hai et al. (Phan et al. 2020). Therefore, we replace depth-wise convolutions with group convolutions in Mobile Net V2 models. As shown in the table, on all three datasets, the proposed any-precision DNN generally achieved comparable performance to the competitive dedicated models. At the same time, our model is more compact. As shown in Table 2, compared with ﬁve individual models, our any-precision model (104MB) saved more than 50% parameters (220MB).

Post-Training Quantization Methods

We compare our method with three alternative post-training quantization methods. We experiment with Resnet-50 on Image Net. The ﬁrst naive baseline directly quantizes dedicated models with bit-shifting. In other words, to obtain an (n 1)-bit

Datasets Models 1 bit 2 bit 4 bit 8 bit FP32 Dedi. Ours Dedi. Ours Dedi. Ours Dedi. Ours Dedi. Ours

Resnet-20 92.07 92.15 93.55 93.97 93.71 93.95 93.66 93.80 94.08 93.98 Alex Net 92.56 93.00 94.06 94.08 94.02 94.26 93.82 94.24 93.74 94.22 Mobile Net V2 80.17 80.06 89.69 90.74 92.01 92.27 92.44 92.32 94.01 93.84

SVHN Resnet-8 92.94 91.65 95.91 94.78 95.15 95.46 94.64 95.39 94.60 95.36 CNN-8 90.94 88.21 96.45 94.94 97.04 96.19 97.04 96.22 97.10 96.29

Image Net Resnet-18 55.06 54.62 63.65 64.19 68.15 67.96 68.48 68.04 69.27 68.16 Resnet-50 61.08 63.18 69.44 73.24 71.24 74.75 74.71 74.91 75.95 74.96

Table 1: Comparison of the proposed any-precision DNN to dedicated models: the proposed method achieved the strong baseline accuracy in most cases, even occasionally outperforms the baselines in low-precision. We hypothesize that the gain is mainly from the knowledge distillation from high-precision models in training. Dedi.: Dedicated models.

Shared 1 2 4 8 32 Total Ded 0 22.3 24.9 30.1 40.4 102.4 220.1 Ours 102.0 0.4 0.4 0.4 0.4 0.4 104.0

Table 2: Resnet-50 model size comparison (in Mega Byte): the proposed model only brings little overhead upon size of a FP32 model but achieves the ﬂexibility what could take 5 independent models instead. Ded: Dedicated models.

Models \ Runtime bit 1 2 4 8 FP32 BS 0.100 0.136 42.70 74.68 75.95 BS+BC 0.106 0.352 57.12 73.73 75.95 ACIQ 0.116 0.324 71.64 1 75.83 75.95 Ours 63.18 73.24 74.75 74.91 74.96

Table 3: Comparison to other post-training quantization methods: All models are Resnet-50 trained on Image Net. When bit-width drops from their original training setting, our method consistently outperform them. BS: Bit-Shifting from FP32 model. BC: Batch Norm Calibration from FP32 model.

model from a trained n-bit model, as what is done with anyprecision DNN shown in Figure 2, we simply drop the leastsigniﬁcant bit of all weights. With no surprise, this strategy fails dramatically on challenging large-scale benchmark as shown in Table 3. The second strategy follows the same bit-shifting to drop bit with an added Batch Norm calibration process. In the calibration process, Batch Norm statistics will be re-calculated by feed-forwarding a number of training samples. As shown in Table 3, the Batch Norm calibration helped in 4-bit but still failed in 1,2-bit settings. With the proposed method, we can leverage this posttraining calibration technique to ﬁll-in the gaps of training candidate bit-width list, i.e., after training for 1,2,4,8,32-bits precision levels, we can further calibrate the model under the remaining 3,5,6,7-bit settings to get the missed copies of Batch Norm layer parameters. So that, in runtime, we can freely choose any precision level from 1 to 8 bits. In addition, we compare with a recently proposed method

1We used the paper s ofﬁcial code https://github.com/ submission2019/cnn-quantization, but we cannot reproduce the claimed accuracy 73.8% on 4-bit model.

a) 2nd Conv-Layer b) 2nd Batch Norm-Layer activation value activation value

Figure 4: Activation value distributions of several layers in an any-precision Alex Net: low-bit quantization leads to value distribution different from others after convolutional layers but accordingly changed Batch Norm layer could rectify the mis-match.

Train \ Test 1 2 3 4 5 6 7 8 1,2,4,8 91.80 93.48 93.22 93.70 93.58 93.52 93.53 93.61 1,8 91.95 89.57 92.97 93.23 93.35 93.34 93.36 93.39 2,8 10.00 93.58 93.50 93.83 93.91 93.90 94.00 93.93 4,8 10.00 72.78 93.22 93.80 93.75 93.79 93.81 93.79

Table 4: Classiﬁcation accuracy of Resnet-20 with different bitwidth combinations in training on Cifar-10.

ACIQ (Banner, Nahshan, and Soudry 2019), which introduced analytical weight clipping, adaptive bit allocation and bias correction for post-training quantization. As shown in Table 3, our method achieved much better accuracy under low bit-width settings.

Dynamically Changed Batch Norm Layers To understand how the dynamically changed Batch Norm layers help in our framework, we visualize the activation value distributions of several layers of an any-precision Alex Net. More speciﬁcally, we look at how activation value distribution changes from the 2nd convolutional layers and the Batch Norm layers after them when the runtime precision level is set to 1,2,4,8-bit respectively. As shown in Figure 4, when running at 1-bit precision, the activation distribution is obviously off from others after the convolutional layers; the followed Batch Norm layer rectiﬁes the distributions; then the next convolutional layer would create this dis-

Models \ Bits 1 2 4 8 FP32

Resnet-20 on Cifar-10

w/o KD 91.67 93.72 93.83 93.94 93.82

KD 92.15 93.97 93.95 93.80 93.98

Resnet-50 on Image Net

w/o KD 62.42 72.88 74.43 74.68 74.98

KD 63.18 73.24 74.75 74.91 74.96

Table 5: Impact of knowledge distillation on classiﬁcation accuracy.

Models \ Bits 1 2 4 8 FP32

Dedicated m Io U 59.5 71.5 74.1 75.6 76.2 Acc. 89.7 93.5 94.2 94.5 94.6

Ours m Io U 61.3 72.1 74.3 75.4 75.9 Acc. 90.6 93.7 94.3 94.4 94.6

Table 6: Segmentation performance comparison on Pascal VOC 2012 segmentation dataset under m Io U and top-1 accuracy.

tribution variation again. It is very clear that by keeping multiple copies of the Batch Norm layer parameters for different bit-widths, we can minimize input variations to the convolutional layers and hence have the same set of convolutional layer parameters to support any-precision in runtime.

Ablation Studies

Candidate bit-width List. We study how the candidate bit-width list used during training the any-precision DNN inﬂuences the testing performance on other bit-widths. Table 4 shows testing accuracy of models trained under different bit-width combinations. We observe that training with more candidate bit-width generally leads to better generalization to the others and the candidate bit-width list is better to cover the extreme cases in runtime. For example, the 1,8-bits combination performs more stable across different runtime bit-width compared with 2,8-bits and 4,8-bits combination. Since better coverage in training takes longer for the model to converge, this observation can guide the bit-width selection under limited training resources.

Knowledge Distillation. We study the inﬂuence of knowledge distillation during any-precision training. The ﬁrst case is w/o KD, i.e., no KD is used as shown in Algorithm 1). The second case is KD, i.e., the highest bit-width outputs are supervised by groundtruth, and then every other bit-width outputs are supervised by the output logits from the nearest superior bit-width. In Table 5, we observed on both Resnet-20 and Resnet-50, KD outperforms the dedicated models in low-bit settings. The hypothesis is that for lower-bit models, soft logits from higher-bit models instead of groundtruth labels regularize the training better.

Application to Semantic Segmentation

To demonstrate generalization of our method to other tasks, we apply the any-precision scheme to semantic segmentation. We train Deeplab V3 (Chen et al. 2017) with Resnet50 on Pascal VOC 2012 segmentation dataset (Everingham et al. 2015) with SBD dataset (Hariharan et al. 2011) as

groundtruth augmentation. We use the publicly available Py Torch training codebase 2. All models are pretrained from the corresponding classiﬁcation models on Image Net. We do not use any post-processing for segmentation. We summarize our results in Table 6 under two evaluation metrics m Io U and top-1 accuracy. Our low-bit models perform better than the dedicated counterparts. In 8,32-bit settings, our models achieve comparable results. Similar to the results on the classiﬁcation tasks, we attribute this improvement to the joint optimization and knowledge distillation.

We study how the multiple bit-width joint training inﬂuences parameter learning. With STE, the parameter being updated in training is essentially the FP32 weight w. The gradient on w comes from losses from different bit-widths. The necessary condition for the joint training to work is that the gradients from different bit-widths are consistent with each other. Inspired by (Zhao et al. 2018), we analyze gradient consistency between bit-widths by computing Update Compliance Average (UCA), which is deﬁned as the average of cosine similarity of gradients from two bit-widths over multiple training steps. During joint training, we observed that different bit-widths share very large UCA ([0.9, 1]), indicating consistent gradient directions and thus training convergence. We also found that neighboring bit-widths share larger UCA than the others, e.g., UCA between 1 and 2 bits (0.929) are larger than 1 and the other bits (0.909 between 1 and 32 bits). This motivated us to employ recursive knowledge distillation in joint training. As indicated by UCA, we also observed a small gradient direction gap. We think this makes our model naturally more robust to adversarial attacks than a single dedicated model as a by-product from joint training. By switching to different bit-widths, any-precision model under one bitwidth can show robustness to adversarial attacks targeting at another bit-width. We experimented with FGSM attacking method (Goodfellow, Shlens, and Szegedy 2015). Anyprecision models show improved defensive performance compared with dedicated ones, e.g., classiﬁcation accuracy of FP32 model is improved by more than 10%. We believe our any-precision scheme can serve as an add-on to other defensive methods against adversarial attacks.

In this paper, we introduce any-precision DNN to address the practical efﬁciency/accuracy trade-off dilemma from a new perspective. Instead of seeking for a better operating point, we enable runtime adjustment of model precisionlevel to support ﬂexible efﬁciency/accuracy trade-off without additional storage or computation cost. The model can be stored at 8-bit or higher and run in lower bit-width speciﬁed at runtime. Our ﬂexible model achieves comparable accuracy to dedicatedly trained low-precision models and surpasses other post-training quantization methods.

2https://github.com/kazuto1011/deeplab-pytorch

References Banner, R.; Nahshan, Y.; and Soudry, D. 2019. Post training 4-bit quantization of convolutional networks for rapiddeployment. In Advances in Neural Information Processing Systems, 7948 7956.

Bengio, Y.; L eonard, N.; and Courville, A. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432 .

Cai, H.; Zhu, L.; and Han, S. 2018. Proxylessnas: Direct neural architecture search on target task and hardware. ar Xiv preprint ar Xiv:1812.00332 .

Cai, Z.; He, X.; Sun, J.; and Vasconcelos, N. 2017. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5918 5926.

Chen, L.-C.; Papandreou, G.; Schroff, F.; and Adam, H. 2017. Rethinking atrous convolution for semantic image segmentation. ar Xiv preprint ar Xiv:1706.05587 .

Chen, X.; Xie, L.; Wu, J.; and Tian, Q. 2019. Progressive Differentiable Architecture Search: Bridging the Depth Gap between Search and Evaluation. ar Xiv preprint ar Xiv:1904.12760 .

Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P. I.-J.; Srinivasan, V.; and Gopalakrishnan, K. 2018. Pact: Parameterized clipping activation for quantized neural networks. ar Xiv preprint ar Xiv:1805.06085 .

Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. ar Xiv preprint ar Xiv:1602.02830 .

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09.

Ding, R.; Chin, T.-W.; Liu, Z.; and Marculescu, D. 2019. Regularizing activation distribution for training binarized deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 11408 11417.

Dong, Z.; Yao, Z.; Gholami, A.; Mahoney, M.; and Keutzer, K. 2019. HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision. ar Xiv preprint ar Xiv:1905.03696 .

Everingham, M.; Eslami, S. A.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2015. The pascal visual object classes challenge: A retrospective. International journal of computer vision 111(1): 98 136.

Figurnov, M.; Collins, M. D.; Zhu, Y.; Zhang, L.; Huang, J.; Vetrov, D.; and Salakhutdinov, R. 2017. Spatially adaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1039 1048.

Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations. URL http://arxiv.org/abs/1412.6572.

Hariharan, B.; Arbelaez, P.; Bourdev, L.; Maji, S.; and Malik, J. 2011. Semantic Contours from Inverse Detectors. In International Conference on Computer Vision (ICCV). He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, 1026 1034. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531 . Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. 2019. Searching for mobilenetv3. ar Xiv preprint ar Xiv:1905.02244 . Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167 . Jung, S.; Son, C.; Lee, S.; Son, J.; Han, J.-J.; Kwak, Y.; Hwang, S. J.; and Choi, C. 2019. Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4350 4359. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980 . Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, 1097 1105. Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.- J.; Fei-Fei, L.; Yuille, A.; Huang, J.; and Murphy, K. 2018a. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), 19 34. Liu, Z.; Wu, B.; Luo, W.; Yang, X.; Liu, W.; and Cheng, K.-T. 2018b. Bi-real net: Enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), 722 737. Nagel, M.; van Baalen, M.; Blankevoort, T.; and Welling, M. 2019. Data-Free Quantization through Weight Equalization and Bias Correction. ar Xiv preprint ar Xiv:1906.04721 . Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic Differentiation in Py Torch. In NIPS Autodiff Workshop.

Phan, H.; He, Y.; Savvides, M.; Shen, Z.; et al. 2020. Mobinet: A mobile binary network for image classiﬁcation. In The IEEE Winter Conference on Applications of Computer Vision, 3453 3462.

Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A. 2016. Xnor-net: Imagenet classiﬁcation using binary convolutional neural networks. In European Conference on Computer Vision, 525 542. Springer.

Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4510 4520.

Tan, M.; and Le, Q. V. 2019. Efﬁcient Net: Rethinking Model Scaling for Convolutional Neural Networks. ar Xiv preprint ar Xiv:1905.11946 .

Tang, W.; Hua, G.; and Wang, L. 2017. How to train a compact binary neural network with high accuracy? In Thirty First AAAI Conference on Artiﬁcial Intelligence.

Teerapittayanon, S.; Mc Danel, B.; and Kung, H.-T. 2016. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), 2464 2469. IEEE.

Veit, A.; and Belongie, S. 2018. Convolutional networks with adaptive inference graphs. In Proceedings of the European Conference on Computer Vision (ECCV), 3 18.

Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; and Han, S. 2019. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8612 8620.

Wu, Z.; Nagarajan, T.; Kumar, A.; Rennie, S.; Davis, L. S.; Grauman, K.; and Feris, R. 2018. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8817 8826.

Yu, J.; and Huang, T. 2019. Universally slimmable networks and improved training techniques. ar Xiv preprint ar Xiv:1903.05134 .

Yu, J.; Yang, L.; Xu, N.; Yang, J.; and Huang, T. 2018. Slimmable neural networks. ar Xiv preprint ar Xiv:1812.08928 .

Zhang, D.; Yang, J.; Ye, D.; and Hua, G. 2018. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), 365 382.

Zhao, X.; Li, H.; Shen, X.; Liang, X.; and Wu, Y. 2018. A modulation module for multi-task learning with applications in image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), 401 416.

Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; and Zou, Y. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. ar Xiv preprint ar Xiv:1606.06160 .

Zhuang, B.; Shen, C.; Tan, M.; Liu, L.; and Reid, I. 2018. Towards effective low-bitwidth convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7920 7928.