# adaptive_gradient_quantization_for_dataparallel_sgd__3cc1fabf.pdf

Adaptive Gradient Quantization

for Data-Parallel SGD

Fartash Faghri1,2 Iman Tabrizian1,2 Ilia Markov3 Dan Alistarh3,4

Daniel M. Roy1,2 Ali Ramezani-Kebrya2

1University of Toronto 2Vector Institute 3IST Austria 4Neural Magic

faghri@cs.toronto.edu iman.tabrizian@mail.utoronto.ca alir@vectorinstitute.ai

Many communication-efﬁcient variants of SGD use gradient quantization schemes. These schemes are often heuristic and ﬁxed over the course of training. We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. In both schemes, processors update their compression schemes in parallel by efﬁciently computing sufﬁcient statistics of a parametric distribution. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on Image Net in challenging low-cost communication setups. Our adaptive methods are also signiﬁcantly more robust to the choice of hyperparameters.

1 Introduction

0 2 4 6 8 Iteration V 1e4

Average 9ariance

Figure 1: Changes in the average variance of normalized gradient coordinates in a Res Net32 model trained on CIFAR-10. Colors distinguish different runs with different seeds. Learning rate is decayed by a factor of 10 twice at 40K and 60K iterations. The variance changes rapidly during the ﬁrst epoch. The next noticeable change happens after the ﬁrst learning rate drop and another one appears after the second drop.

Stochastic gradient descent (SGD) and its variants are currently the method of choice for training deep models. Yet, large datasets cannot always be trained on a single computational node due to memory and scalability limitations. Data-parallel SGD is a remarkably scalable variant, in particular on multi-GPU systems [1 10]. However, despite its many advantages, distribution introduces new challenges for optimization algorithms. In particular, data-parallel SGD has large communication cost due to the need to transmit potentially huge gradient vectors. Ideally, we want distributed optimization methods that match the performance of SGD on a single hypothetical super machine, while paying a negligible communication cost.

A common approach to reducing the communication cost in data-parallel SGD is gradient compression and quantization [4, 11 16]. In full-precision data-parallel SGD, each processor broadcasts its locally computed stochastic gradient vector at every iteration, whereas in quantized data-parallel SGD, each processor compresses its stochastic gradient before broadcasting. Current quantization methods are either designed heuristically or ﬁxed prior to training. Convergence rates in a stochastic optimization problem are controlled by the trace of the gradient covariance matrix, which is referred

Equal contributions.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

as the gradient variance in this paper [17]. As Fig. 1 shows, no ﬁxed method can be optimal throughout the entire training because the distribution of gradients changes. A quantization method that is optimal at the ﬁrst iteration will not be optimal after only a single epoch.

In this paper, we propose two adaptive methods for quantizing the gradients in data-parallel SGD. We study methods that are deﬁned by a norm and a set of quantization levels. In Adaptive Level Quantization (ALQ), we minimize the excess variance of quantization given an estimate of the distribution of the gradients. In Adaptive Multiplier Quantization (AMQ), we minimize the same objective as ALQ by modelling quantization levels as exponentially spaced levels. AMQ solves for the optimal value of a single multiplier parametrizing the exponentially spaced levels.

1.1 Summary of contributions

We propose two adaptive gradient quantization methods, ALQ and AMQ, in which processors

update their compression methods in parallel. We establish an upper bound on the excess variance for any arbitrary sequence of quantization

levels under general normalization that is tight in dimension, an upper bound on the expected number of communication bits per iteration, and strong convergence guarantees on a number of problems under standard assumptions. Our bounds hold for any adaptive method, including ALQ and AMQ. We improve the validation accuracy by almost 2% on CIFAR-10 and 1% on Image Net in challenging

low-cost communication setups. Our adaptive methods are signiﬁcantly more robust to the choice of hyperparameters.2

1.2 Related work

Adaptive quantization has been used for speech communication and storage [18]. In machine learning, several biased and unbiased schemes have been proposed to compress networks and gradients. Recently, lattice-based quantization has been studied for distributed mean estimation and variance reduction [19]. In this work, we focus on unbiased and coordinate-wise schemes to compress gradients.

Alistarh et al. [20] proposed Quantized SGD (QSGD) focusing on the uniform quantization of stochastic gradients normalized to have unit Euclidean norm. Their experiments illustrate a similar quantization method, where gradients are normalized to have unit L1 norm, achieves better performance. We refer to this method as QSGDinf or Qinf in short. Wen et al. [15] proposed Tern Grad, which can be viewed as a special case of QSGDinf with three quantization levels.

Ramezani-Kebrya et al. [21] proposed nonuniform quantization levels (NUQSGD) and demonstrated superior empirical results compared to QSGDinf. Horváth et al. [22] proposed natural compression and dithering schemes, where the latter is a special case of logarithmic quantization.

There have been prior attempts at adaptive quantization methods. Zhang et al. [23] proposed Zip ML, which is an optimal quantization method if all points to be quantized are known a priori. To ﬁnd the optimal sequence of quantization levels, a dynamic program is solved whose computational and memory cost is quadratic in the number of points to be quantized, which in the case of gradients would correspond to their dimension. For this reason, Zip ML is impractical for quantizing on the ﬂy, and is in fact used for (ofﬂine) dataset compression. They also proposed an approximation where a subsampled set of points is used and proposed to scan the data once to ﬁnd the subset. However, as we show in this paper, this one-time scan is not enough as the distribution of stochastic gradients changes during the training.

Zhang et al. [24] proposed LQ-Net, where weights and activations are quantized such that the inner products can be computed efﬁciently with bitwise operations. Compared to LQ-Net, our methods do not need additional memory for encoding vectors. Concurrent with our work, Fu et al. [25] proposed to quantize activations and gradients by modelling them with Weibull distributions. In comparison, our proposed methods accommodate general distributions. Further, our approach does not require any assumptions on the upper bound of the gradients.

2Open source code: http://github.com/tabrizian/learning-to-quantize

Input: Local data, parameter vector (local copy) wt, learning rate , and set of update steps U

1 for t = 1 to T do

2 if t 2 U then

3 for i = 1 to M do

4 Compute sufﬁcient statistics and update quantization levels ;

5 for i = 1 to M do

6 Compute gi(wt), encode ci,t ENCODE

, and broadcast ci,t;

7 for j = 1 to M do

8 Receive ci,t from each processor i and decode ˆgi(wt) DECODE

9 Aggregate wt+1 P

i=1 ˆgi(wt)

; Algorithm 1: Adaptive data-parallel SGD. Loops are executed in parallel on each machine. At certain steps, each processor computes sufﬁcient statistics of a parametric distribution to estimate distribution of normalized coordinates.

2 Preliminaries: data-parallel SGD

Consider the problem of training a model parametrized by a high-dimensional vector w 2 Rd. Let Rd denote a closed and compact set. Our goal is to minimize f : ! R. Assume we have access to unbiased stochastic gradients of f, which is g, such that E[g(w)] = rf(w) for all w 2 .

The update rule for full-precision SGD is given by wt+1 = P

wt g(wt)) where wt is the current parameter vector, is the learning rate, and P is the Euclidean projection onto . We consider data-parallel SGD, which is a synchronous and distributed framework consisting of M processors. Each processor receives gradients from all other processors and aggregates them. In data-parallel SGD with compression, gradients are compressed by each processor before transmission and decompressed before aggregation [20 23]. A stochastic compression method is unbiased if the vector after decompression is in expectation the same as the original vector.

3 Adaptive quantization

In this section, we introduce novel adaptive compression methods that adapt during the training (Algorithm 1). Let v 2 Rd be a vector we seek to quantize and ri = |vi|/kvk be its normalized coordinates for i = 1, . . . , d.3 Let q (r) : [0, 1] ! [0, 1] denote a random quantization function applied to the normalized coordinate r using adaptable quantization levels, = [ 0, . . . , s+1]>, where 0 = 0 < 1 < < s < s+1 = 1. For r 2 [0, 1], let (r) denote the index of a level such that (r) r < (r)+1. Let (r) = (r (r))/( (r)+1 (r)) be the relative distance of r to level (r) + 1. We deﬁne the random variable h(r) such that h(r) = (r) with probability 1 (r) and h(r) = (r)+1 with probability (r).

Figure 2: Random quantization of normalized gradient.

We deﬁne the quantization of v as Q (v) , [q (v1), . . . , q (vd)]> where q (vi) = kvk sign(vi) h(ri) and h = {h(ri)}i=1,...,d are independent random variables. The encoding, ENCODE(v), of a stochastic gradient is the combined encoding of kvk using a standard ﬂoating point encoding along with an optimal encoding of h(ri) and binary encoding of sign(vi) for each coordinate i. The decoding, DECODE, recovers the norm, h(ri), and the sign. Additional details of the encoding method are described in Appendix D.

We deﬁne the variance of vector quantization to be the trace of the covariance matrix,

Eh[k Q (v) vk2

σ2(ri), (1)

where σ2(r) = E[(q (r) r)2] is the variance of quantization for a single coordinate that is given by

σ2(r) = ( (r)+1 r)(r (r)). (2)

3In this section, we use k k to denote a general Lq norm with q 1 for simplicity.

Let v be a random vector corresponding to a stochastic gradient and h capture the randomness of quantization for this random vector as deﬁned above. We deﬁne two minimization problems, expected variance and expected normalized variance minimization:

k Q (v) vk2

k Q (v) vk2

where L = { : j j+1, 8 j, 0 = 0, s+1 = 1} denotes the set of feasible solutions. We ﬁrst focus on the problem of minimizing the expected normalized variance and then extend our methods to minimize the expected variance in Section 3.4. Let F(r) denote the marginal cumulative distribution function (CDF) of a normalized coordinate r. Assuming normalized coordinates ri are i.i.d. given kvk, the expected normalized variance minimization can be written as

2L ( ), where ( ) ,

σ2(r) d F(r). (3)

The following theorem suggests that solving (3) is challenging in general; however, the sub-problem of optimizing a single level given other levels can be solved efﬁciently in closed form. Proofs are provided in Appendix B.

Theorem 1 (Expected normalized variance minimization). Problem (3) is nonconvex in general. However, the optimal solution to minimize one level given other levels, min i ( ), is given by

i = β( i 1, i+1), where

β(a, c) = F 1

r a c a d F(r)

3.1 ALQ: Adapting individual levels using coordinate descent

Using the single level update rule in Eq. (4) we iteratively adapt individual levels to minimize the expected normalized variance in (3). We denote quantization levels at iteration t by (t) starting from t = 0. The update rule is

j(t + 1) = β( j 1(t), j+1(t)) 8j = 1, . . . , s . (5)

Performing the update rule above sequentially over coordinates j is a form of coordinate descent (CD) that is guaranteed to converge to a local minima. CD is particularly interesting because it does not involve any projection step to the feasible set L. In practice, we initialize the levels with either uniform levels [20] or exponentially spaced levels proposed in [21]. We observe that starting from either initialization CD converges in small number of steps (less than 10).

3.2 Gradient descent

Computing r using Leibniz s rule [26], the gradient descent (GD) algorithm to solve (3) is based on the following update rule:

j(t + 1) = PL

j(t) (t)@ ( (t))

(r j 1(t)) d F(r)

( j+1(t) r) d F(r)

for t = 0, 1, . . . and j = 1, . . . , s. Note that the projection step in Eq. (6) is itself a convex optimization problem. We propose a projection-free modiﬁcation of GD update rule to systematically ensure 2 L. Let δj(t) = min{ j(t) j 1(t), j+1(t) j(t)} denote the minimum distance between two neighbouring levels at iteration t for j = 1, . . . , s. If the change in level j is bounded by δj(t)/2, it is guaranteed that 2 L. We propose to replace Eq. (6) with the following update rule:

j(t + 1) = j(t) sign

)))) , δj(t)

3.3 AMQ: Exponentially spaced levels

We now focus on = [ 1, p, . . . , ps, ps, . . . , p, 1]>, i.e., exponentially spaced levels with symmetry. We can update p efﬁciently by gradient descent using the ﬁrst order derivative

2sp2s 1 d F(r) +

(jpj 1 + (j + 1)pj)r (2j + 1)p2j+

d F(r). (8)

3.4 Expected variance minimization

In this section, we consider the problem of minimizing the expected variance of quantization:

k Q (v) vk2

To solve the expected variance minimization problem, suppose that we observe N stochastic gradients {v1, . . . , v N}. Let Fn(r) and pn(r) denote the CDF and PDF of normalized coordinate conditioned on observing kvnk, respectively. By taking into account randomness in kvk and using the law of total expectation, an approximation of the expected variance in (9) is given by

E[k Qs(v) vk2

σ2(r) d Fn(r). (10)

The optimal levels to minimize Eq. (10) are a solution to the following problem:

σ2(r) d Fn(r) = arg min

σ2(r) d F(r),

s]> and F(r) = PN

n=1 γn Fn(r) is the weighted sum of the conditional CDFs with γn = kvnk2/ PN

n=1 kvnk2. Note that we can accommodate both normal and truncated normal distributions by substituting associated expressions into pn(r) and Fn(r). Exact update rules and analysis of computational complexity of ALQ, GD, and AMQ are discussed in Appendix C.

4 Theoretical guarantees

One can alternatively design quantization levels to minimize the worst-case variance. However, compared to an optimal scheme, this worst-case scheme increases the expected variance by (d), which is prohibitive in deep networks. We quantify the gap in Appendix E. Proofs are in appendices.

A stochastic gradient has a second-moment upper bound B when E[kg(w)k2

2] B for all w 2 . Similarly, it has a variance upper bound σ2 when E[kg(w) rf(w)k2

2] σ2 for all w 2 .

We consider a general adaptively quantized SGD (AQSGD) algorithm, described in Algorithm 1, where compression schemes are updated over the course of training.4 Many convergence results in stochastic optimization rely on a variance bound. We establish such a variance bound for our adaptive methods. Further, we verify that these optimization results can be made to rely only on the average variance. In the following, we provide theoretical guarantees for AQSGD algorithm, obtain variance and code-length bounds, and convergence guarantees for convex, nonconvex, and momentum-based variants of AQSGD.

The analysis of nonadaptive methods in [20 23] can be considered as special cases of our theorems with ﬁxed levels over the course of training. A naive adoption of available convergence guarantees results in having worst-case variance bounds over the course of training. In this paper, we show that an average variance bound can be applied on a number of problems. Under general normalization, we ﬁrst obtain variance upper bound for arbitrary levels, in particular, for those obtained adaptively. Theorem 2 (Variance bound). Let v 2 Rd and q 1. The quantization of v under Lq normalization satisﬁes E[Q (v)] = v. Furthermore, we have

E[k Q (v) vk2

4Our results hold for any adaptive method, including ALQ and AMQ.

where Q = ( j +1/ j 1)2

4( j +1/ j ) + inf0<p<1 Kp 1

2 p min{q,2} with j = arg max1 j s j+1/ j and

Theorem 2 implies that if g(w) is a stochastic gradient with a second-moment bound , then Q (g(w)) is a stochastic gradient with a variance upper bound Q . Note that, as long as the maximum ratio of two consecutive levels does not change, the variance upper bound decreases with the number of quantization levels. In addition, our bound matches the known (

d) lower bound in [27].

Theorem 3 (Code-length bound). Let v 2 Rd and q 1. The expectation E[|ENCODE(v)|] of the number of communication bits needed to transmit Q (v) under Lq normalization is bounded by

E[|ENCODE(v)|] b + n 1,d + d(H(L) + 1) b + n 1,d + d(log2(s + 2) + 1), (12)

where b is a constant, n 1,d = min{ 1

1 , d}, H(L) is the entropy of L in bits, and L is a random variable with the probability mass function given by

r j 1 j j 1

j+1 r j+1 j

for j = 1, . . . , s. In addition, we have

Pr( 0 = 0) =

d F(r) and Pr( s+1 = 1) =

Theorem 3 provides a bound on the expected number of communication bits to encode the quantized stochastic gradients. As expected, the upper bound in (12) increases monotonically with d and s.

We can combine variance and code-length upper bounds and obtain convergence guarantees for AQSGD when applied to various learning problems where we have convergence guarantees for full-precision SGD under standard assumptions.

Let { 1, . . . , K} denote the set of quantization levels that AQSGD experiences on the optimization trajectory. Suppose that k is used for Tk iterations with PK

k=1 Tk = T. For each particular k, we can obtain corresponding variance bound Q,k by substituting k into (11). Then the average variance upper bound is given by Q = PK

k=1 Tk Q,k/T. For each particular k, we can obtain corresponding expected code-length bound NQ,k by substituting random variable Lk into (12). The average expected code-length bound is given by NQ = PK

k=1 Tk NQ,k/T.

On convex problems, convergence guarantees can be established along the lines of [17, Theorems 6.1]. Theorem 4 (AQSGD for nonsmooth convex optimization). Let f : ! R denote a convex function and let R2 , supw2 kw w0k2

2. Let ˆB = (1 + Q)B and f = infw2 f(w). Suppose that AQSGD is executed for T iterations with a learning rate = RM/( ˆB

T) on M processors, each with access to independent stochastic gradients of f with a second-moment bound B, such that quantization levels are updated K times where k with variance bound Q,k and code-length bound

NQ,k is used for Tk iterations. Then AQSGD satisﬁes E

In addition, AQSGD requires at most NQ communication bits per iteration in expectation.

In Appendix H and Appendix I, we obtain convergence guarantees on nonconvex problems and for momentum-based variants of AQSGD under standard assumptions, respectively. Theoretical guarantees for levels with symmetry are established in Appendix J.

5 Experimental evaluation

In this section, we showcase the effectiveness of our adaptive quantization methods in speeding up training deep models. We compare our methods to the following baselines: single-GPU SGD (SGD), full-precision multi-GPU SGD (Super SGD), uniform levels under L1 normalization (QSGDinf) [20], ternary levels under L1 normalization (TRN) [15], and exponential levels under L2 normalization with exponential factor p = 0.5 (NUQSGD) [21, 22]. We present results for the following variations of

Table 1: Validation accuracy on CIFAR-10 and Image Net using 3 bits (except for Super SGD and TRN) with 4 GPUs.

Quantization Method Res Net-110 on

Res Net-32 on

Res Net-18 on

Bucket Size 16384 8192 8192

Super SGD 93.86% 0.08 92.26% 0.04 68.93% 0.05

NUQSGD [21, 22] 84.60% 0.04 83.73% 0.08 33.36% 0.07 QSGDinf [20] 91.52% 0.07 89.95% 0.02 66.35% 0.04 TRN [15] 90.72% 0.06 89.65% 0.05 62.76% 0.06

ALQ 93.24% 0.06 91.30% 0.07 67.72% 0.07 ALQ-N 93.14% 0.05 91.96% 0.04 65.64% 0.07 AMQ 92.82% 0.04 91.10% 0.05 64.82% 0.05 AMQ-N 92.88% 0.02 91.03% 0.08 66.75% 0.05

0 1 2 3 4 5 6 7 8 7r DLQLQg Iter Dt LRQ 1e4

6u Ser6GD AL4 A04 4LQI 751

(a) Res Net-32 on CIFAR-10

0 1 2 3 4 5 6 7 8 7ra Ln Lng Iterat Lon 1e4

(b) Res Net-110 on CIFAR-10

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Tra Ln Lng Iterat Lon 1e5

(c) Res Net-18 on Image Net

Figure 3: Validation loss on CIFAR-10 and Image Net. All methods use 3 bits except for Super SGD and TRN. Bucket size for Res Net-110 trained on CIFAR-10 is 16384, for Res Net-32 is 8192, and for Res Net-18 on Image Net is 8192.

our proposed methods: ALQ and AMQ (with norm adjustments in Section 3.4), and their normalized variations ALQ-N and AMQ-N (Sections 3.1 and 3.3). We present full training results on Image Net in Appendix K along with additional experimental details.

We compare methods in terms of the number of training iterations that is independent of a particular distributed setup. In Table 1, we present results for training Res Net-32 and Res Net-110 [28] on CIFAR-10 [29], and Res Net-18 on Image Net [30]. We simulate training with 4-GPUs on a single GPU by quantizing and dequantizing the gradient from 4 mini-batches in each training iteration. These simulations allow us to compare the performance of quantization methods to the hypothetical full-precision Super SGD.

All quantization methods studied in this section share two hyper-parameters: the number of bits (log2 of number of quantization levels) and a bucket size. A common trick used in normalized quantization is to encode and decode a high-dimensional vector in buckets such that each coordinate is normalized by the norm of its corresponding bucket instead of the norm of the entire vector [20]. The bucket size controls the tradeoff between extra communication cost and loss of precision. With a small bucket size, there are more bucket norms to be communicated, while with a large bucket size, we lose numerical precision as a result of dividing each coordinate by a large number. In Section 5.1, we provide an empirical study of the hyperparameters.

Matching the accuracy of Super SGD. Using only 3 bits (8 levels), our adaptive methods match the performance of Super SGD on CIFAR-10 and close the gap on Image Net (bold in Table 1). Our most ﬂexible method, ALQ, achieves the best overall performance on Image Net and the gap on CIFAR-10 with ALQ-N is less than 0.3%. There is at least 1.4% gap between our best performing method and previous work in training each model. To the best of our knowledge, matching the validation loss of Super SGD has not been achieved in any previous work using only 3 bits. Fig. 3 shows the test loss and Fig. 4 shows the average gradient variance where the average is taken over gradient coordinates. Our adaptive methods successfully achieve lower variance during training.

Comparison on the trajectory of SGD. Fig. 5 shows the average variance on the optimization trajectory of single-GPU without quantization. This graph provides a more fair comparison of the

0 1 2 3 4 5 6 7 8 7r DLQLQg Iter Dt LRQ 1e4

Aver Dge 9Dr LDQce

6u Ser6GD AL4 A04 4LQI 6GD 751

(a) Res Net-32 on CIFAR-10

0 1 2 3 4 5 6 7 8 7raining Iteration 1e4

Average 9ariance

(b) Res Net-110 on CIFAR-10

0.0 0.5 1.0 1.5 2.0 2.5 3.0 7raining Iteration 1e5

Average Variance

(c) Res Net-18 on Image Net

Figure 4: Variance on CIFAR-10 and Image Net. All methods use 3 bits except for Super SGD and TRN. Bucket size for Res Net-110 trained on CIFAR-10 is 16384, for Res Net-32 is 8192, and for Res Net-18 on Image Net is 8192.

0 1 2 3 4 5 6 7 8 7r DLQLQg Iter Dt LRQ 1e4

Aver Dge 9Dr LDQce

6u Ser6GD AL4 A04 4LQI 6GD 751

(a) Res Net-32 on CIFAR-10

0 1 2 3 4 5 6 7 8 7raining Iteration 1e4

Average 9ariance

(b) Res Net-110 on CIFAR-10

0.0 0.5 1.0 1.5 2.0 2.5 3.0 7raining Iteration 1e5

Average Variance

(c) Res Net-18 on Image Net

Figure 5: Variance (no train) on CIFAR-10 and Image Net. All methods use 3 bits except for Super SGD and TRN. Bucket size for Res Net-110 trained on CIFAR-10 is 16384, for Res Net-32 is 8192, and for Res Net-18 on Image Net is 8192.

quantization error of different methods decoupled from their impact on the optimization trajectory. ALQ effectively ﬁnds an improved set of levels that reduce the variance in quantization. ALQ matches the variance of Super SGD on Resnet-110 (Fig. 5b). In Figs. 5b and 5c, the variance of QSGDinf is as high as TRN in the ﬁrst half of training. This shows that extra levels (8 uniform levels) do not perform better unless designed carefully. As expected, the variance of Super SGD is always smaller than the variance of SGD by a constant factor of the number of GPUs.

Negligible computational overhead. Our adaptive methods have similar per-step computation and communication cost compared to previous methods. On Image Net, we save at least 60 hours from 95 hours of training and add only an additional cost of at most 10 minutes in total to adapt quantization. For bucket sizes 8192 and 16384 and 3 8 bits used in our experiments, the per-step cost relative to Super SGD (32-bits) is 21 25% for Res Net-18 on Image Net and 32 36% for Res Net-50. That is the same as the cost of NUQSGD and QSGDinf without additional coding or pruning with the same number of bits and bucket sizes. The cost of the additional update speciﬁc to ALQ is 0.4 0.5% of the total training time. In Appendix K.3, we provide tables with detailed timing results for varying bucket sizes and bits.

5.1 Hyperparameter studies

Fig. 6 shows quantization levels for each method at the end of training Res Net-32 on CIFAR-10. The quantization levels for our adaptive methods are more concentrated near zero. In Figs. 7a and 7b, we study the impact of the bucket size and number of bits on the best validation accuracy achieved by quantization methods.

Adaptive levels are the best quantization methods across all values of bucket size and number of bits. ALQ and ALQ-N are the best performing methods across all values of bucket size and number of bits. The good performance of ALQ-N is unexpected as it suggests quantization for vectors with different norms can be shared. In practice, ALQ-N is easier to implement and faster to update compared to ALQ. We observe a similar relation between AMQ and AMQ-N methods. Adaptive multiplier methods show inferior performance to adaptive level methods as the bucket size signiﬁcantly grows (above 104) or shrinks (below 100) as well as for very few bits (2). Note that there exists a known generalization gap between SGD and Super SGD in Res Net-110 that can be closed by extensive hyperparameter tuning [31]. Our adaptive methods reduce this gap with standard hyperparameters.

1.0 0.5 0.0 0.5 1.0 Level value

Level LQdex

AL4 A04 AL4-1 A04-1 4LQf 751 1U4,p 0.5

Figure 6: Quantization levels at the end of training Res Net-32 on CIFAR-10.

10 100 1000 10000 %uc Net SLze

Val Ldat LRQ Accu Uacy (%)

ALQ A0Q ALQ-1 A0Q-1 QLQf 751 18Q,S 0.5

(a) Bucket Size (bits=3)

2 3 4 5 6 7 # b Lt V

Val Ldat LRQ Accu Uacy (%)

AL4 A04 AL4-N A04-N 4LQf 75N N84,p 0.5

(b) Bits (bucket size=16384)

Figure 7: Effect of bucket size and number of bits on validation accuracy when training Res Net-8 on CIFAR-10

Table 2: Validation accuracy of Res Net32 on CIFAR-10 using 3 quantization bits (except for Super SGD and TRN) and bucket size 16384.

Method 16 GPUs 32 GPUs

Super SGD 92.17% 0.08 92.19% 0.04

NUQSGD 85.82% 0.03 86.36% 0.01 QSGDinf 89.61% 0.03 89.81% 0.05 TRN 88.68% 0.10 90.22% 0.05

ALQ 91.91% 0.06 91.89% 0.07 ALQ-N 92.07% 0.04 91.83% 0.03 AMQ 91.58% 0.05 91.38% 0.06 AMQ-N 91.41% 0.08 91.40% 0.02

Bucket size signiﬁcantly impacts non-adaptive methods. For bucket size 100 and 3 bits, NUQSGD performs nearly as good as adaptive methods but quickly loses accuracy as the bucket size grows or shrinks. QSGDinf stays competitive for a wider range of bucket sizes but still loses accuracy faster than other methods. This shows the impact of bucketing as an understudied trick in evaluating quantization methods.

Adaptive methods successfully scale to large number of GPUs. Table 2 shows the result of training CIFAR-10 on Res Net-32 using 16 and 32 GPUs. Note that with 32 GPUs, TRN is achieving almost the accuracy of Super SGD with only 3 quantization levels, which is expected because TRN is unbiased and the variance of aggregated gradients decreases linearly with the number of GPUs.

6 Conclusions

To reduce communication costs of data-parallel SGD, we introduce two adaptively quantized methods, ALQ and AMQ, to learn and adapt gradient quantization method on the ﬂy. In addition to quantization method, in both methods, processors learn and adapt their coding methods in parallel by efﬁciently computing sufﬁcient statistics of a parametric distribution. We establish tight upper bounds on the excessive variance for any arbitrary sequence of quantization levels under general normalization and on the expected number of communication bits per iteration. Under standard assumptions, we establish a number of convergence guarantees for our adaptive methods. We demonstrate the superiority of ALQ and AMQ over nonadaptive methods empirically on deep models and large datasets.

Broader impact

This work provides additional understanding of statistical behaviour of deep machine learning models. We aim to train deep models using popular SGD algorithm as fast as possible without compromising learning outcome. As the amount of data gathered through web and a plethora of sensors deployed everywhere (e.g., Io T applications) is drastically increasing, the design of efﬁcient machine learning algorithms that are capable of processing large-scale data in a reasonable time can improve everyone s quality of life. Our compression schemes can be used in Federated Learning settings, where a deep

model is trained on data distributed among multiple owners without exposing that data. Developing privacy-preserving learning algorithms is an integral part of responsible and ethical AI. However, the long-term impacts of our schemes may depend on how machine learning is used in society.

Acknowledgement

The authors would like to thank Blair Bilodeau, David Fleet, Mufan Li, and Jeffrey Negrea for helpful discussions. FF was supported by OGS Scholarship. DA and IM were supported the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No 805223 Scale ML). DMR was supported by an NSERC Discovery Grant. ARK was supported by NSERC Postdoctoral Fellowship. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.5

[1] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In

Proc. Advances in Neural Information Processing Systems (NIPS), 2010.

[2] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and

distributed approaches. Cambridge University Press, 2011.

[3] B. Recht, C. Ré, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic

gradient descent. In Proc. Advances in Neural Information Processing Systems (NIPS), 2011.

[4] J. Dean, G. Corrado, R. Monga K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker,

K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In Proc. Advances in Neural Information Processing Systems (NIPS), 2012.

[5] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A. Ng. Deep learning with COTS

HPC systems. In Proc. International Conference on Machine Learning (ICML), 2013.

[6] T. Chilimbi, Y. Suzue J. Apacible, and K. Kalyanaraman. Project adam: Building an efﬁcient

and scalable deep learning training system. In Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.

[7] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita,

and B.-Y. Su. Scaling distributed machine learning with the parameter server. In Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.

[8] J. C. Duchi, S. Chaturapruek, and C. Ré. Asynchronous stochastic convex optimization. In

Proc. Advances in Neural Information Processing Systems (NIPS), 2015.

[9] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and

Y. Y. Petuum. Petuum: A new platform for distributed machine learning on big data. IEEE

transactions on Big Data, 1(2):49 67, 2015.

[10] S. Zhang, A. E. Choromanska, and Y. Le Cun. Deep learning with elastic averaging SGD. In

Proc. Advances in Neural Information Processing Systems (NIPS), 2015.

[11] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application

to data-parallel distributed training of speech DNNs. In Proc. INTERSPEECH, 2014.

[12] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited

numerical precision. In Proc. International Conference on Machine Learning (ICML), 2015.

[13] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,

J. Dean, and M. Devin. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv:1603.04467, 2016.

5www.vectorinstitute.ai/#partners

[14] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Do Re Fa-Net: Training low bitwidth

convolutional neural networks with low bitwidth gradients. ar Xiv:1606.06160, 2016.

[15] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Tern Grad: Ternary gradients to

reduce communication in distributed deep learning. In Proc. Advances in Neural Information Processing Systems (NIPS), 2017.

[16] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. sign SGD: Compressed

optimisation for non-convex problems. In Proc. International Conference on Machine Learning (ICML), 2018.

[17] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in

Machine Learning, 8(3-4):231 358, 2015.

[18] P. Cummiskey, N. S. Jayant, and J. L. Flanagan. Adaptive quantization in differential PCM

coding of speech. Bell System Technical Journal, 52(7):1105 1118, 1973.

[19] D. Alistarh, S. Ashkboos, and P. Davies. Distributed mean estimation with optimal error bounds.

ar Xiv:2002.09268v2, 2020.

[20] D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, and M. Vojnovic. QSGD: Communication-

efﬁcient SGD via gradient quantization and encoding. In Proc. Advances in Neural Information Processing Systems (NIPS), 2017.

[21] A. Ramezani-Kebrya, F. Faghri, and D. M. Roy. NUQSGD: Improved communication efﬁciency

for data-parallel SGD via nonuniform quantization. ar Xiv preprint ar Xiv:1908.06077v1, 2019.

[22] S. Horváth, C.-Y Ho, L. Horváth, A. N. Sahu, M. Canini, and P. Richtárik. Natural compression

for distributed deep learning. ar Xiv:1905.10988v1, 2019.

[23] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. Zip ML: Training linear models with

end-to-end low precision, and a little bit of deep learning. In Proc. International Conference on Machine Learning (ICML), 2017.

[24] D. Zhang, J. Yang, D. Ye, and G. Hua. LQ-Nets: Learned quantization for highly accurate and

compact deep neural networks. In Proc. European Conference on Computer Vision (ECCV), 2018.

[25] F. Fu, Y. Hu, Y. He, J. Jiang, Y. Shao, C. Zhang, and B. Cui. Don t waste your bits! squeeze

activations and gradients for deep neural networks via TINYSCRIPT. In Proc. International Conference on Machine Learning (ICML), 2020.

[26] M. H. Protter and C. B. Morrey. Intermediate Calculus. Springer, 1985.

[27] A. Ramezani-Kebrya, F. Faghri, I. Markov, V. Aksenov, D. Alistarh, and D. M. Roy. NUQSGD:

Provably communication-efﬁcient data-parallel SGD via nonuniform quantization. Technical Report.

[28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[29] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report,

University of Toronto, 2009.

[30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A large-scale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[31] Dami Choi, Christopher J. Shallue, Zachary Nado, Jaehoon Lee, Chris J. Maddison, and

George E. Dahl. On Empirical Comparisons of Optimizers for Deep Learning. ar Xiv e-prints, art. ar Xiv:1910.05446, October 2019.

[32] T. M. Cover and J. A. Thomas. Elements of Information Theory. WILEY, 2006.

[33] S. Ghadimi and G. Lan. Stochastic ﬁrstand zeroth-order methods for nonconvex stochastic

programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

[34] T. Yang, Q. Lin, and Z. Li. Uniﬁed convergence analysis of stochastic momentum methods for

convex and non-convex optimization. ar Xiv:1604.03257v2, 2016.

[35] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR

Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

[36] Y. Nesterov. A method of solving a convex programming problem with convergence O(1/k2).

Soviet Mathematics Doklady, 27(2):372 376, 1983.