# adaptive_gradient_quantization_for_dataparallel_sgd__3cc1fabf.pdf Adaptive Gradient Quantization for Data-Parallel SGD Fartash Faghri1,2 Iman Tabrizian1,2 Ilia Markov3 Dan Alistarh3,4 Daniel M. Roy1,2 Ali Ramezani-Kebrya2 1University of Toronto 2Vector Institute 3IST Austria 4Neural Magic faghri@cs.toronto.edu iman.tabrizian@mail.utoronto.ca alir@vectorinstitute.ai Many communication-efficient variants of SGD use gradient quantization schemes. These schemes are often heuristic and fixed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efficiently computing sufficient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on Image Net in challenging low-cost communication setups. Our adaptive methods are also significantly more robust to the choice of hyperparameters. 1 Introduction 0 2 4 6 8 Iteration V 1e4 Average 9ariance Figure 1: Changes in the average variance of normalized gradient coordinates in a Res Net32 model trained on CIFAR-10. Colors distinguish different runs with different seeds. Learning rate is decayed by a factor of 10 twice at 40K and 60K iterations. The variance changes rapidly during the first epoch. The next noticeable change happens after the first learning rate drop and another one appears after the second drop. Stochastic gradient descent (SGD) and its variants are currently the method of choice for training deep models. Yet, large datasets cannot always be trained on a single computational node due to memory and scalability limitations. Data-parallel SGD is a remarkably scalable variant, in particular on multi-GPU systems [1 10]. However, despite its many advantages, distribution introduces new challenges for optimization algorithms. In particular, data-parallel SGD has large communication cost due to the need to transmit potentially huge gradient vectors. Ideally, we want distributed optimization methods that match the performance of SGD on a single hypothetical super machine, while paying a negligible communication cost. A common approach to reducing the communication cost in data-parallel SGD is gradient compression and quantization [4, 11 16]. In full-precision data-parallel SGD, each processor broadcasts its locally computed stochastic gradient vector at every iteration, whereas in quantized data-parallel SGD, each processor compresses its stochastic gradient before broadcasting. Current quantization methods are either designed heuristically or fixed prior to training. Convergence rates in a stochastic optimization problem are controlled by the trace of the gradient covariance matrix, which is referred Equal contributions. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. as the gradient variance in this paper [17]. As Fig. 1 shows, no fixed method can be optimal throughout the entire training because the distribution of gradients changes. A quantization method that is optimal at the first iteration will not be optimal after only a single epoch. In this paper, we propose two adaptive methods for quantizing the gradients in data-parallel SGD. We study methods that are defined by a norm and a set of quantization levels. In Adaptive Level Quantization (ALQ), we minimize the excess variance of quantization given an estimate of the distribution of the gradients. In Adaptive Multiplier Quantization (AMQ), we minimize the same objective as ALQ by modelling quantization levels as exponentially spaced levels. AMQ solves for the optimal value of a single multiplier parametrizing the exponentially spaced levels. 1.1 Summary of contributions We propose two adaptive gradient quantization methods, ALQ and AMQ, in which processors update their compression methods in parallel. We establish an upper bound on the excess variance for any arbitrary sequence of quantization levels under general normalization that is tight in dimension, an upper bound on the expected number of communication bits per iteration, and strong convergence guarantees on a number of problems under standard assumptions. Our bounds hold for any adaptive method, including ALQ and AMQ. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on Image Net in challenging low-cost communication setups. Our adaptive methods are significantly more robust to the choice of hyperparameters.2 1.2 Related work Adaptive quantization has been used for speech communication and storage [18]. In machine learning, several biased and unbiased schemes have been proposed to compress networks and gradients. Recently, lattice-based quantization has been studied for distributed mean estimation and variance reduction [19]. In this work, we focus on unbiased and coordinate-wise schemes to compress gradients. Alistarh et al. [20] proposed Quantized SGD (QSGD) focusing on the uniform quantization of stochastic gradients normalized to have unit Euclidean norm. Their experiments illustrate a similar quantization method, where gradients are normalized to have unit L1 norm, achieves better performance. We refer to this method as QSGDinf or Qinf in short. Wen et al. [15] proposed Tern Grad, which can be viewed as a special case of QSGDinf with three quantization levels. Ramezani-Kebrya et al. [21] proposed nonuniform quantization levels (NUQSGD) and demonstrated superior empirical results compared to QSGDinf. Horváth et al. [22] proposed natural compression and dithering schemes, where the latter is a special case of logarithmic quantization. There have been prior attempts at adaptive quantization methods. Zhang et al. [23] proposed Zip ML, which is an optimal quantization method if all points to be quantized are known a priori. To find the optimal sequence of quantization levels, a dynamic program is solved whose computational and memory cost is quadratic in the number of points to be quantized, which in the case of gradients would correspond to their dimension. For this reason, Zip ML is impractical for quantizing on the fly, and is in fact used for (offline) dataset compression. They also proposed an approximation where a subsampled set of points is used and proposed to scan the data once to find the subset. However, as we show in this paper, this one-time scan is not enough as the distribution of stochastic gradients changes during the training. Zhang et al. [24] proposed LQ-Net, where weights and activations are quantized such that the inner products can be computed efficiently with bitwise operations. Compared to LQ-Net, our methods do not need additional memory for encoding vectors. Concurrent with our work, Fu et al. [25] proposed to quantize activations and gradients by modelling them with Weibull distributions. In comparison, our proposed methods accommodate general distributions. Further, our approach does not require any assumptions on the upper bound of the gradients. 2Open source code: http://github.com/tabrizian/learning-to-quantize Input: Local data, parameter vector (local copy) wt, learning rate , and set of update steps U 1 for t = 1 to T do 2 if t 2 U then 3 for i = 1 to M do 4 Compute sufficient statistics and update quantization levels ; 5 for i = 1 to M do 6 Compute gi(wt), encode ci,t ENCODE , and broadcast ci,t; 7 for j = 1 to M do 8 Receive ci,t from each processor i and decode ˆgi(wt) DECODE 9 Aggregate wt+1 P i=1 ˆgi(wt) ; Algorithm 1: Adaptive data-parallel SGD. Loops are executed in parallel on each machine. At certain steps, each processor computes sufficient statistics of a parametric distribution to estimate distribution of normalized coordinates. 2 Preliminaries: data-parallel SGD Consider the problem of training a model parametrized by a high-dimensional vector w 2 Rd. Let Rd denote a closed and compact set. Our goal is to minimize f : ! R. Assume we have access to unbiased stochastic gradients of f, which is g, such that E[g(w)] = rf(w) for all w 2 . The update rule for full-precision SGD is given by wt+1 = P wt g(wt)) where wt is the current parameter vector, is the learning rate, and P is the Euclidean projection onto . We consider data-parallel SGD, which is a synchronous and distributed framework consisting of M processors. Each processor receives gradients from all other processors and aggregates them. In data-parallel SGD with compression, gradients are compressed by each processor before transmission and decompressed before aggregation [20 23]. A stochastic compression method is unbiased if the vector after decompression is in expectation the same as the original vector. 3 Adaptive quantization In this section, we introduce novel adaptive compression methods that adapt during the training (Algorithm 1). Let v 2 Rd be a vector we seek to quantize and ri = |vi|/kvk be its normalized coordinates for i = 1, . . . , d.3 Let q (r) : [0, 1] ! [0, 1] denote a random quantization function applied to the normalized coordinate r using adaptable quantization levels, = [ 0, . . . , s+1]>, where 0 = 0 < 1 < < s < s+1 = 1. For r 2 [0, 1], let (r) denote the index of a level such that (r) r < (r)+1. Let (r) = (r (r))/( (r)+1 (r)) be the relative distance of r to level (r) + 1. We define the random variable h(r) such that h(r) = (r) with probability 1 (r) and h(r) = (r)+1 with probability (r). Figure 2: Random quantization of normalized gradient. We define the quantization of v as Q (v) , [q (v1), . . . , q (vd)]> where q (vi) = kvk sign(vi) h(ri) and h = {h(ri)}i=1,...,d are independent random variables. The encoding, ENCODE(v), of a stochastic gradient is the combined encoding of kvk using a standard floating point encoding along with an optimal encoding of h(ri) and binary encoding of sign(vi) for each coordinate i. The decoding, DECODE, recovers the norm, h(ri), and the sign. Additional details of the encoding method are described in Appendix D. We define the variance of vector quantization to be the trace of the covariance matrix, Eh[k Q (v) vk2 σ2(ri), (1) where σ2(r) = E[(q (r) r)2] is the variance of quantization for a single coordinate that is given by σ2(r) = ( (r)+1 r)(r (r)). (2) 3In this section, we use k k to denote a general Lq norm with q 1 for simplicity. Let v be a random vector corresponding to a stochastic gradient and h capture the randomness of quantization for this random vector as defined above. We define two minimization problems, expected variance and expected normalized variance minimization: k Q (v) vk2 k Q (v) vk2 where L = { : j j+1, 8 j, 0 = 0, s+1 = 1} denotes the set of feasible solutions. We first focus on the problem of minimizing the expected normalized variance and then extend our methods to minimize the expected variance in Section 3.4. Let F(r) denote the marginal cumulative distribution function (CDF) of a normalized coordinate r. Assuming normalized coordinates ri are i.i.d. given kvk, the expected normalized variance minimization can be written as 2L ( ), where ( ) , σ2(r) d F(r). (3) The following theorem suggests that solving (3) is challenging in general; however, the sub-problem of optimizing a single level given other levels can be solved efficiently in closed form. Proofs are provided in Appendix B. Theorem 1 (Expected normalized variance minimization). Problem (3) is nonconvex in general. However, the optimal solution to minimize one level given other levels, min i ( ), is given by i = β( i 1, i+1), where β(a, c) = F 1 r a c a d F(r) 3.1 ALQ: Adapting individual levels using coordinate descent Using the single level update rule in Eq. (4) we iteratively adapt individual levels to minimize the expected normalized variance in (3). We denote quantization levels at iteration t by (t) starting from t = 0. The update rule is j(t + 1) = β( j 1(t), j+1(t)) 8j = 1, . . . , s . (5) Performing the update rule above sequentially over coordinates j is a form of coordinate descent (CD) that is guaranteed to converge to a local minima. CD is particularly interesting because it does not involve any projection step to the feasible set L. In practice, we initialize the levels with either uniform levels [20] or exponentially spaced levels proposed in [21]. We observe that starting from either initialization CD converges in small number of steps (less than 10). 3.2 Gradient descent Computing r using Leibniz s rule [26], the gradient descent (GD) algorithm to solve (3) is based on the following update rule: j(t + 1) = PL j(t) (t)@ ( (t)) (r j 1(t)) d F(r) ( j+1(t) r) d F(r) for t = 0, 1, . . . and j = 1, . . . , s. Note that the projection step in Eq. (6) is itself a convex optimization problem. We propose a projection-free modification of GD update rule to systematically ensure 2 L. Let δj(t) = min{ j(t) j 1(t), j+1(t) j(t)} denote the minimum distance between two neighbouring levels at iteration t for j = 1, . . . , s. If the change in level j is bounded by δj(t)/2, it is guaranteed that 2 L. We propose to replace Eq. (6) with the following update rule: j(t + 1) = j(t) sign )))) , δj(t) 3.3 AMQ: Exponentially spaced levels We now focus on = [ 1, p, . . . , ps, ps, . . . , p, 1]>, i.e., exponentially spaced levels with symmetry. We can update p efficiently by gradient descent using the first order derivative 2sp2s 1 d F(r) + (jpj 1 + (j + 1)pj)r (2j + 1)p2j+ d F(r). (8) 3.4 Expected variance minimization In this section, we consider the problem of minimizing the expected variance of quantization: k Q (v) vk2 To solve the expected variance minimization problem, suppose that we observe N stochastic gradients {v1, . . . , v N}. Let Fn(r) and pn(r) denote the CDF and PDF of normalized coordinate conditioned on observing kvnk, respectively. By taking into account randomness in kvk and using the law of total expectation, an approximation of the expected variance in (9) is given by E[k Qs(v) vk2 σ2(r) d Fn(r). (10) The optimal levels to minimize Eq. (10) are a solution to the following problem: σ2(r) d Fn(r) = arg min σ2(r) d F(r), s]> and F(r) = PN n=1 γn Fn(r) is the weighted sum of the conditional CDFs with γn = kvnk2/ PN n=1 kvnk2. Note that we can accommodate both normal and truncated normal distributions by substituting associated expressions into pn(r) and Fn(r). Exact update rules and analysis of computational complexity of ALQ, GD, and AMQ are discussed in Appendix C. 4 Theoretical guarantees One can alternatively design quantization levels to minimize the worst-case variance. However, compared to an optimal scheme, this worst-case scheme increases the expected variance by (d), which is prohibitive in deep networks. We quantify the gap in Appendix E. Proofs are in appendices. A stochastic gradient has a second-moment upper bound B when E[kg(w)k2 2] B for all w 2 . Similarly, it has a variance upper bound σ2 when E[kg(w) rf(w)k2 2] σ2 for all w 2 . We consider a general adaptively quantized SGD (AQSGD) algorithm, described in Algorithm 1, where compression schemes are updated over the course of training.4 Many convergence results in stochastic optimization rely on a variance bound. We establish such a variance bound for our adaptive methods. Further, we verify that these optimization results can be made to rely only on the average variance. In the following, we provide theoretical guarantees for AQSGD algorithm, obtain variance and code-length bounds, and convergence guarantees for convex, nonconvex, and momentum-based variants of AQSGD. The analysis of nonadaptive methods in [20 23] can be considered as special cases of our theorems with fixed levels over the course of training. A naive adoption of available convergence guarantees results in having worst-case variance bounds over the course of training. In this paper, we show that an average variance bound can be applied on a number of problems. Under general normalization, we first obtain variance upper bound for arbitrary levels, in particular, for those obtained adaptively. Theorem 2 (Variance bound). Let v 2 Rd and q 1. The quantization of v under Lq normalization satisfies E[Q (v)] = v. Furthermore, we have E[k Q (v) vk2 4Our results hold for any adaptive method, including ALQ and AMQ. where Q = ( j +1/ j 1)2 4( j +1/ j ) + inf0