# analysis_of_quantized_models__7f20a9e8.pdf Published as a conference paper at ICLR 2019 ANALYSIS OF QUANTIZED MODELS Lu Hou1, Ruiliang Zhang1,2, James T. Kwok1 1Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong {lhouab,jamesk}@cse.ust.hk 2Tu Simple ruiliang.zhang@tusimple.ai Deep neural networks are usually huge, which significantly limits the deployment on low-end devices. In recent years, many weight-quantized models have been proposed. They have small storage and fast inference, but training can still be time-consuming. This can be improved with distributed learning. To reduce the high communication cost due to worker-server synchronization, recently gradient quantization has also been proposed to train deep networks with full-precision weights. In this paper, we theoretically study how the combination of both weight and gradient quantization affects convergence. We show that (i) weight-quantized models converge to an error related to the weight quantization resolution and weight dimension; (ii) quantizing gradients slows convergence by a factor related to the gradient quantization resolution and dimension; and (iii) clipping the gradient before quantization renders this factor dimension-free, thus allowing the use of fewer bits for gradient quantization. Empirical experiments confirm the theoretical convergence results, and demonstrate that quantized networks can speed up training and have comparable performance as full-precision networks. 1 INTRODUCTION Deep neural networks are usually huge. The high demand in time and space can significantly limit deployment on low-end devices. To alleviate this problem, many approaches have been recently proposed to compress deep networks. One direction is network quantization, which represents each network weight with a small number of bits. Besides significantly reducing the model size, it also accelerates network training and inference. Many weight quantization methods aim at approximating the full-precision weights in each iteration (Courbariaux et al., 2015; Lin et al., 2016; Rastegari et al., 2016; Li & Liu, 2016; Lin et al., 2017; Guo et al., 2017). Recently, loss-aware quantization minimizes the loss directly w.r.t. the quantized weights (Hou et al., 2017; Hou & Kwok, 2018; Leng et al., 2018), and often achieves better performance than approximation-based methods. Distributed learning can further speed up training of weight-quantized networks (Dean et al., 2012). A key challenge is on reducing the expensive communication cost incurred during synchronization of the gradients and model parameters (Li et al., 2014a;b). Recently, algorithms that sparsify (Aji & Heafield, 2017; Wangni et al., 2017) or quantize the gradients (Seide et al., 2014; Wen et al., 2017; Alistarh et al., 2017; Bernstein et al., 2018) have been proposed. In this paper, we consider quantization of both the weights and gradients in a distributed environment. Quantizing both weights and gradients has been explored in the Do Re Fa-Net (Zhou et al., 2016), QNN (Hubara et al., 2017), WAGE (Wu et al., 2018) and Zip ML (Zhang et al., 2017). We differ from them in two aspects. First, existing methods mainly consider learning on a single machine, and gradient quantization is used to reduce the computations in backpropagation. On the other hand, we consider a distributed environment, and use gradient quantization to reduce communication cost and accelerate distributed learning of weight-quantized networks. Second, while Do Re Fa-Net, QNN and WAGE show impressive empirical results on the quantized network, theoretical guarantees are not provided. Zip ML provides convergence analysis, but is limited to stochastic weight quantization, square loss with the linear model, and requires the stochastic gradients to be unbiased. This can Published as a conference paper at ICLR 2019 be restrictive as most state-of-the-art weight quantization methods (Rastegari et al., 2016; Lin et al., 2016; Li & Liu, 2016; Guo et al., 2017; Hou et al., 2017; Hou & Kwok, 2018) are deterministic, and the resultant stochastic gradients are biased. In this paper, we relax the restrictions on the loss function, and study in an online learning setting how the gradient precision affects convergence of weight-quantized networks in a distributed environment. The main findings are: 1. With either full-precision or quantized gradients, the average regret of loss-aware weight quantization does not converge to zero, but to an error related to the weight quantization resolution w and dimension d. The smaller the w or d, the smaller is the error (Theorems 1 and 2). 2. With either full-precision or quantized gradients, the average regret converges with a O(1/ T) rate to the error, where T is the number of iterations. However, gradient quantization slows convergence (relative to using full-precision gradients) by a factor related to gradient quantization resolution g and d. The larger the g or d, the slower is the convergence (Theorems 1 and 2). This can be problematic when (i) the weight quantized model has a large d (e.g., deep networks); and (ii) the communication cost is a bottleneck in the distributed setting, which favors a small number of bits for the gradients, and thus a large g. 3. For gradients following the normal distribution, gradient clipping renders the speed degradation mentioned above dimension-free. However, an additional error is incurred. The convergence speedup and error are related to how aggressive clipping is performed. More aggressive clipping results in faster convergence, but a larger error (Theorem 3). 4. Empirical results show that quantizing gradients significantly reduce communication cost, and gradient clipping makes speed degradation caused by gradient quantization negligible. With quantized clipped gradients, distributed training of weight-quantized networks is much faster, while comparable accuracy with the use of full-precision gradients is maintained (Section 4). Notations. For a vector x, x is the element-wise square root, x2 is the element-wise square, Diag(x) returns a diagonal matrix with x on the diagonal, and x y is the element-wise multiplication of vectors x and y. For a matrix Q, x 2 Q = x Qx. For a matrix X, X is the element-wise square root, and diag(X) returns a vector extracted from the diagonal elements of X. 2 PRELIMINARIES 2.1 ONLINE LEARNING Online learning continually adapts the model with a sequence of observations. It has been commonly used in the analysis of deep learning optimizers (Duchi et al., 2011; Kingma & Ba, 2015; Reddi et al., 2018). At time t, the algorithm picks a model with parameter wt S, where S is a convex compact set. The algorithm then incurs a loss ft(wt). After T rounds, the performance is usually evaluated by the regret R(T) = PT t=1 ft(wt) ft(w ) and average regret R(T)/T, where w = arg minw S PT t=1 ft(w) is the best model parameter in hindsight. 2.2 WEIGHT QUANTIZATION In Binary Connect (Courbariaux et al., 2015), each weight is binarized using the sign function either deterministically or stochastically. In ternary-connect (Lin et al., 2016), each weight is stochastically quantized to { 1, 0, 1}. Stochastic weight quantization often suffers severe accuracy degradation, while deterministic weight quantization (as in the binary-weight-network (BWN) (Rastegari et al., 2016) and ternary weight network (TWN) (Li & Liu, 2016)) achieves much better performance. In this paper, we will focus on loss-aware weight quantization, which further improves performance by considering the effect of weight quantization on the loss. Examples include loss-aware binarization (LAB) (Hou et al., 2017) and loss-aware quantization (LAQ) (Hou & Kwok, 2018). Let the full-precision weights from all L layers in the deep network be w. The corresponding quantized weight is denoted Qw(w) = ˆw, where Qw( ) is the weight quantization function. At the (t + 1)th iteration, the second-order Taylor expansion of ft( ˆw), i.e., ft( ˆwt) + ft( ˆwt) ( ˆw ˆwt) + 1 2( ˆw ˆwt) Ht( ˆw ˆwt) is minimized w.r.t. ˆw, where Ht is the Hessian at ˆwt. A direct computation of Published as a conference paper at ICLR 2019 Ht is expensive. In practice, this is approximated by Diag( ˆvt), where ˆvt is the moving average: ˆvt = βˆvt 1 + (1 β)ˆg2 t = Xt j=1(1 β)βt jˆg2 j, (1) with gt the stochastic gradient, β 1, and is readily available in popular deep network optimizers such as RMSProp and Adam. Diag( ˆvt) is also an estimate of Diag( p diag(H2 t)) (Dauphin et al., 2015). Computationally, the quantized weight is obtained by first performing a preconditioned gradient descent wt+1 = wt ηt Diag( ˆvt) 1ˆgt, followed by quantization via solving the following problem: ˆwt+1 = Qw(wt+1) = arg min ˆw wt+1 ˆw 2 Diag( ˆvt) s.t. ˆw = αb, α > 0, b (Sw)d. (2) For simplicity of notations, we assume that the same scaling parameter α is used for all layers. Extension to layer-wise scaling is straightforward. For binarization, Sw = { 1, +1}, the weight quantization resolution is w = 1, and a simple closed-form solution is obtained in (Hou et al., 2017). For m-bit linear quantization, Sw = { Mk, . . . , M1, M0, M1, . . . , Mk}, where k = 2m 1 1, 0 = M0 <