# intsgd_adaptive_floatless_compression_of_stochastic_gradients__14aa5bba.pdf Published as a conference paper at ICLR 2022 INTSGD: ADAPTIVE FLOATLESS COMPRESSION OF STOCHASTIC GRADIENTS Konstantin Mishchenko CNRS, École Normale Supérieure, Inria konsta.mish@gmail.com Bokun Wang KAUST bokunw.wang@gmail.com Dmitry Kovalev KAUST dakovalev1@gmail.com Peter Richtárik KAUST peter.richtarik@kaust.edu.sa We propose a family of adaptive integer compression operators for distributed Stochastic Gradient Descent (SGD) that do not communicate a single float. This is achieved by multiplying floating-point vectors with a number known to every device and then rounding to integers. In contrast to the prior work on integer compression for Switch ML by Sapio et al. (2021), our Int SGD method is provably convergent and computationally cheaper as it estimates the scaling of vectors adaptively. Our theory shows that the iteration complexity of Int SGD matches that of SGD up to constant factors for both convex and non-convex, smooth and nonsmooth functions, with and without overparameterization. Moreover, our algorithm can also be tailored for the popular all-reduce primitive and shows promising empirical performance. 1 INTRODUCTION Many recent breakthroughs in machine learning were made possible due to the introduction of large, sophisticated and high capacity supervised models whose training requires days or even weeks of computation (Hinton et al., 2015; He et al., 2016; Huang et al., 2017; Devlin et al., 2018). However, it would not be possible to train them without corresponding advances in parallel and distributed algorithms capable of taking advantage of modern hardware. Very large models are typically trained on vast collections of training data stored in a distributed fashion across a number of compute nodes that need to communicate throughout the training process. In this scenario, reliance on efficient communication protocols is of utmost importance. Communication in distributed systems. The training process of large models relies on fast synchronization of gradients computed in a parallel fashion. Formally, to train a model, we want to solve the problem of parallel/distributed minimization of the average of n functions: f(x) def = 1 n i=1 fi(x) , fi(x) def = Eξ[fi(x; ξ)], (1) where we will compute the gradients of stochastic realizations fi(x; ξ). The two dominating protocols for gradient synchronization are all-reduce and all-gather aggregation, which may use either Parameter Server or all-to-all communication under the hood. The core difference between them lies in that all-gather communicates all vectors, whereas all-reduce only outputs their average. As shown in previous works, current distributed deep learning algorithms predominantly use all-reduce as it scales much better than all-gather (Vogels et al., 2019; Agarwal et al., 2021). A popular way to reduce the communication cost of both all-reduce and all-gather primitives is to use lossy compression of gradients (Ramesh et al., 2021). To study the benefit of lossy compression, large swaths of recent literature on distributed training attribute the cost of sending a single vector Work done when the author was a research intern at KAUST. Published as a conference paper at ICLR 2022 from a worker to the server to the number of bits needed to represent it. Based on this abstraction, various elaborate vector compression techniques (see Table 1 in Beznosikov et al. 2020; Xu et al. 2020; Safaryan et al. 2020) and algorithms have been designed for higher and higher compression ratios. However, in real systems, the efficiency of sending a vector is not fully characterized by the number of bits alone, because: First, many compressors with high compression ratio (e.g., natural compression (Horváth et al., 2019), quantization (Alistarh et al., 2017), top-k sparsification, sign (Bernstein et al., 2018)) are not compatible with the efficient all-reduce primitive and require all-gather implementation. Secondly, some compressors rely on expensive operations such as low-rank decomposition (Wang et al., 2018; Vogels et al., 2019) or bit-level operations (Horváth et al., 2019), whose computation overhead may outweigh the benefits of reduced communication load. Thirdly, algorithms with biased compressors such as Top-k SGD, Sign SGD, Power SGD (Vogels et al., 2019), require the error-feedback (EF-SGD) mechanism (Stich et al., 2018; Karimireddy et al., 2019) to ensure the convergence. Alas, error feedback needs extra sequences that may not fit the low memory budget of GPUs. Moreover, to the best of our knowledge, no convergence guarantee has been established for EF-SGD on the non-smooth objectives with multiple workers. Switch ML. Another approach to combating long communication times is to improve the hardware itself. The recently proposed Switch ML is an alternative to the NVIDIA Collective Communications Library (NCCL) on real-world hardware (Sapio et al., 2021). The first key component of Switch ML is the in-network aggregation (INA) by a programmable switch. INA reduces the communication cost and latency because the execution can be paralleled and pipelined. To be specific, it splits the vector to aggregate into chunks and processes them individually by the switch pipeline. The advantages of INA over parameter server and all-reduce in terms of latency and communication cost have been theoretically and empirically justified by Sapio et al. (2021). The second key component of Switch ML is stochastic gradient descent with integer rounding and aggregation. Instead of reducing the data volume to exchange, the goal of integer rounding in Switch ML is to fit the limited computation capability of the modern programmable switch, which only supports integer additions or logic operations. To increase the rounding precision, the gradient gk i on device i multiplies a positive scaling factor αk known to every worker and then rounded to an integer number Int(αk gk i ). As there is no additional scaling or decompression before aggregating the communicated vectors, their sums can be computed on the fly. Then, each worker can divide the aggregated gradient by nαk to update the model. However, Sapio et al. (2021) remark that the choice of the scaling factor αk requires special care. In their presentation1, one of the authors notes: A bad choice of scaling factor can reduce the performance. To this end, they propose a heuristic-based profiling step that is executed before the gradient aggregation and keeps the rounded integers small to fit in 32 bits. We refer to their algorithm including the profiling step as Heuristic Int SGD. Unfortunately, no convergence guarantee for that algorithm has been established. This is where our theory comes to the rescue. By rigorously and exhaustively analyzing integer rounding based on scaling, we find adaptive rules for the scaling factor αk that do not require the profiling employed by Sapio et al. (2021). As we will show in the remainder of the paper, our algorithm is perfectly suited for both in-network aggregation (INA) of Switch ML and for other efficient primitives such as all-reduce. 1.1 CONTRIBUTIONS We summarize the key differences of our algorithm and prior work in Table 1, and we also list our main contributions below. Adaptive Int SGD. We develop a family of computationally cheap adaptive scaling factors for provably convergent Int SGD. It is a better alternative to the Heuristic Int SGD in Sapio et al. (2021) that requires expensive operations and does not ensure convergence. 1https://youtu.be/g BPHFy BWVo M?t=606 Published as a conference paper at ICLR 2022 Table 1: Conceptual comparison of our method to the related literature. If all-reduce is supported, the method does not need any decompression. If all-reduce is not supported, the expensive all-gather operation is required and decompression is slow. See also Section 5 for numerical comparisons. Algorithm Supports all-reduce Supports switch Provably works Fast compression Works without error-feedback Adaptive Reference Int SGD Ours Heuristic Int SGD Sapio et al. (2021) Power SGD (theoretical) (1) (2) Vogels et al. (2019) Power SGD (practical) (1) (2) Vogels et al. (2019) Nat SGD N/A Horváth et al. (2019) QSGD N/A Alistarh et al. (2017) Sign SGD N/A Karimireddy et al. (2019) (1) In theory, Power SGD requires computing low-rank decompositions. In practice, an approximation is found by power iteration, which requires just a few matrix-vector multiplications but it is not analyzed theoretically and might be less stable. (2) Power SGD requires tuning the rank of the low-rank decomposition. Vogels et al. (2019) reported that rank-1 Power SGD consistently underperformed in their experiments, and, moreover, rank-2 was optimal for image classification while language modeling required rank-4. Ramesh et al. (2021) reported that a much larger rank was needed to avoid a gap in the training loss. Rates. We obtain the first analysis of the integer rounding and aggregation for distributed machine learning. For all of the proposed variants, we prove convergence rates of Int SGD that match those of full-precision SGD up to constant factors. Our results are tight and apply to both convex and nonconvex problems. Our analysis does not require any extra assumption compared to those typically invoked for SGD. In contrast to other compression-based methods, Int SGD has the same rate as that of full-precision SGD even on non-smooth problems. Int DIANA. We observe empirically that Int SGD struggles when the devices have heterogeneous (non-identical) data an issue it shares with vanilla SGD and propose an alternative method, Int DIANA, that can provably alleviate this issue. We also show that our tools are useful for extending the methodology beyond SGD methods, for example, to variance reduced methods (Johnson & Zhang, 2013; Allen-Zhu & Hazan, 2016; Kovalev et al., 2020; Gower et al., 2020) with integer rounding. Please refer to Appendix A.2 for theoretical results and Appendix C.5 for the empirical verification. 2 ADAPTIVE INTEGER ROUNDING AND INTSGD By randomized integer rounding we mean the mapping Int : R Z defined by Int(t) def = ( [t] + 1, with probability pt def = t [t], [t], with probability 1 pt, where [t] denotes the floor of t R, i.e., [t] = k Z, where k is such that k t < k + 1. Note that E [Int(t)] = (t [t])([t] + 1) + ([t] + 1 t)[t] = t. We extend this mapping to vectors x Rd by applying in element-wise: Int(x)i def = Int(xi). 2.1 ADAPTIVE INTEGER ROUNDING Given a scaling vector α Rd with nonzero entries, we further define the adaptive integer rounding operator Q : Rd Rd by Q(x) def = 1 α Int(α x), (2) where a b def = (a1b1, . . . , adbd) Rd denotes the Hadamard product of two vectors a = (a1, . . . , ad) Rd and b = (b1, . . . , bd) Rd. Published as a conference paper at ICLR 2022 Algorithm 1 Int SGD. Default setting for the tested problems: β = 0.9, ε = 10 8. 1: Params: Stepsizes ηk, scaling vectors αk Rd 2: Init: x0 Rd, x1 = x0 η0 1 n Pn i=1 g0 i 3: for k = 1, 2, . . . do 4: for each device i = 1, 2, . . . , n do 5: Compute stochastic gradient gk i (E gk i | xk fi(xk)) 6: Maintain the moving average: rk = βrk 1 + (1 β) xk xk 1 2 7: Compute the adaptive scaling factor: αk = 2nrk/η2 k+ε2 8: Scale and round the local gradient Q(gk i ) = Int(αk gk i ) 9: end for 10: Aggregate Q(gk i ) by either all-reduce or in-network aggregation (INA) 11: for each device i = 1, 2, . . . , n do 12: Compute the (sub)gradient estimator: gk = 1 nαk Pn i=1 Q(gk i ) 13: Update the model parameter xk+1 = xk ηk gk 14: end for 15: end for As we show below, the adaptive integer rounding operator (2) has several properties which will be useful in our analysis. In particular, the operator is unbiased, and its variance can be controlled by choice of a possibly random scaling vector α Rd ++. Lemma 1. For any x Rd and α Rd ++, we have 1 α E [Int(α x)] = x, (3) α Int(α x) x 2i d P 1 4α2 j , (4) The expectations above are taken with respect to the randomness inherent in the rounding operator. 2.2 NEW ALGORITHM: INTSGD We are ready to present our algorithm, Int SGD. At iteration k, each device i computes a stochastic (sub)gradient vector gk i , i.e., a vector satisfying E gk i | xk fi(xk). (5) Prior to communication, each worker i rescales its stochastic (sub)gradients gk i using the same vector αk Rd ++, and applies the randomized rounding operator Int. The resulting vectors Int(αk gk i ) are aggregated to obtain Pn i=1 Int(αk gk i ), which is also an integer. Each device subsequently performs division by n and inverse scaling to decode the message, obtaining the vector gk def = 1 nαk n P i=1 Int(αk gk i ) = 1 1 αk Int(αk gk i ) (2)= 1 i=1 Q(gk i ). Here αk is a random adaptive scaling factor calculated based on the historical information. We left the design of αk to Section 4. By combining (5) and (3), we observe that gk is a stochastic (sub)gradient of f at xk. Finally, all devices perform in parallel an SGD-type step of the form xk+1 = xk ηk gk and the process is repeated. Our Int SGD method is formally stated as Algorithm 1 with the suggested rule of α. Relation to QSGD (Alistarh et al., 2017). QSGD bears some similarity to the Int SGD: Both of them scale gk i by a factor before the quantization (the scaling factor in QSGD is 1/ gk i for normalization). However, some key difference makes the communication efficiency of Int SGD much better than that of QSGD. It is worth noting that the normalization factors 1/ gk i in QSGD are different for various workers. Then, the quantized values of various workers need to be gathered and decompressed before aggregation. On the contrary, the scaling factor αk in our Int SGD is the same for all workers such that the sum of integers can be computed on the fly. Thus, Int SGD supports the efficient allreduce primitive while QSGD does not. As seen in the experimental results in Section 5 , this Published as a conference paper at ICLR 2022 makes a big difference in empirical performance. Moreover, the proof technique for Int SGD is also intrinsically different from that of QSGD. Please see the next section for the details. 3 ANALYSIS OF INTSGD2 To establish convergence of Int SGD, we introduce the following assumption on the scaling vector αk = (αk,1, . . . , αk,d) Rd ++. Assumption 1. There exists β [0, 1) and a sufficiently small ε > 0 such that j=1 E h η2 k α2 k,j bounded above by η2 kε2 + 2n(1 β) k 1 P t=0 βt E xk t xk t 1 2 . While this assumption may look exotic, it captures precisely what we need to establish the convergence of Int SGD, and it holds for several practical choices of αk, including the one shown in Section 4 and more choices in Appendix A.1. Challenges in Int SGD analysis. Although the Int operation is unbiased and has finite variance as shown in Lemma 1, we highlight that it is non-trivial to obtain the convergence analysis of Int SGD and the analysis is different from that of QSGD and similar methods. Indeed, QSGD, Rank-k, and Nat SGD all use unbiased operators Q with variance satisfying E[ Q(g) g 2] ω g 2 for some ω > 0. For them, the convergence theory is simply a plug-in of the analysis of Compressed SGD (Khirirat et al., 2018). However, the integer rounding operator Int does not satisfy this property, and the variance of the integer compressor will not decrease to zero when g 2 0. Moreover, to estimate αk adaptively, we use the values of past iterates, which makes its value itself random, so the analysis of Compressed SGD cannot apply. As shown in our proofs, an extra trick is required: we reserve an additional term Pk t=0 E[ xt+1 xt 2] to control the variance of the rounding. Furthermore, the analysis of moving-average estimation is particularly challenging since the value of αk is affected by all past model updates, starting from the very first iteration. 3.1 NON-SMOOTH ANALYSIS: GENERIC RESULT Let us now show that Int SGD works well even on non-smooth functions. Assumption 2. Stochastic (sub)gradients gk 1, . . . , gk n sampled at iteration k satisfy the inequalities 1 i=1 Ek[gk i ] 2 G2, 1 n i=1 Ek h gk i Ek[gk i ] 2i σ2, (6) where the former inequality corresponds to G-Lipschitzness of f and the latter to bounded variance of stochastic (sub)gradients. Theorem 1. Let functions f1, . . . , fn be convex and Assumptions 1 and 2 be satisfied. Then E f(ˆxk) f(x ) x0 x 2+2 Pk t=0 η2 t 2 Pk 1 t=0 ηt , where ˆxk = 1 Pk t=0 ηt Pk t=0 ηtxt is a weighted average of iterates. 3.2 SMOOTH ANALYSIS: GENERIC RESULT We now develop a theory for smooth objectives. Assumption 3. There exist constants L, σ 0 such that the stochastic gradients gk 1, . . . , gk n at iteration k satisfy Ek[gk i ] = fi(xk) and i=1 gk i 2 L(f(xk) f(x )) + σ2 n . (7) 2In our analysis, we use the red color to highlight the extra terms coming from our integer compression, in contrast to the blue error terms, which come from SGD itself. Published as a conference paper at ICLR 2022 Assumption 3 is known as the expected smoothness assumption (Gower et al., 2019). In its formulation, we divide the constant term σ2 by n, which is justified by the following proposition. Proposition 1 (Section 3.3 in Gower et al. 2019). Let fi(x) = Eξ[fi(x; ξ)], gk i = fi(xk; ξk i ), and fi( ; ξ) be convex and its gradient be Li-Lipschitz for any ξ. Then, the second part of Assumption 3 is satisfied with σ2 def = 2 n Pn i=1 Eξ fi(x ; ξ) 2 and L def = 4 maxi=1,...,n Li. Gower et al. (2019) state and prove this result in a more general form, so for the reader s convenience, we provide a proof in the appendix. Theorem 2. Assume that f is convex and Assumption 3 holds. If ηk 1 2L and ˆxk = 1 Pk t=0 ηt Pk t=0 ηtxt is a weighted average of iterates, then E f(ˆxk) f(x ) x0 x 2+2 Pk t=0 η2 t 2 Pk t=0 ηt . Corollary 1 (Overparameterized regime). When the model is overparameterized (i.e., the losses can be minimized to optimality simultaneously: σ = 0), we can set ε = 0 and obtain O 1 3.3 NON-CONVEX ANALYSIS: GENERIC RESULT We now develop a theory for non-convex objectives. Assumption 4. The gradient of f is L-Lipschitz and there exists f inf R such that f inf f(x) for all x. Furthermore, for all i and k we have E gk i fi(xk) 2 σ2. (8) Our main result in the non-convex regime follows. Theorem 3. Let f be L-smooth and let Assumption 1 hold. If ηk 1 2L for all k, then E f(ˆxk) 2 2 f(x0) f inf+ Pk t=0 η2 t L Pk t=0 ηt . where ˆxk is sampled from {x0, . . . , xk} with probabilities proportional to η0, . . . , ηk. 3.4 EXPLICIT COMPLEXITY RESULTS Having developed generic complexity results for Int SGD in the non-smooth (Section 3.1), smooth (Section 3.2) and non-convex (Section 3.3) regimes, we now derive explicit convergence rates. Corollary 2. For any sequence of scaling vectors αk satisfying Assumption 1, we recover the following complexities: (i) if f1, . . . , fn are convex, Assumption 2 is satisfied and ηt = η = x0 x k(G2+σ2/n) = O 1 t = 0, . . . , k, then E f(ˆxk) f(x ) = O (ii) if f is convex, Assumption 3 holds and ηt = min n 1 2L, x0 x n E f(ˆxk) f(x ) = O (iii) if f is non-convex, Assumption 4 holds and ηt = min 1 2L, (f(x0) f inf)n E f(ˆxk) 2 = O kn + f(x0) f inf Based on Corollary 2, our Int SGD has linear speed-up in case (ii) and (iii). Published as a conference paper at ICLR 2022 Comparison with error-feedback (EF-SGD). Distributed SGD with biased compressors (like Power SGD, Sign SGD, Top-k SGD) requires the error-feedback modification to converge. In the non-convex and smooth case, EF-SGD leads to the O σ k 2/3 rate in Koloskova et al. (2019) when assuming the second moment of stochastic gradient is bounded by G2. Compared to their result, our rate is never worse and does not require the second moment of stochastic gradient to be bounded, which is often violated in practice even for quadratic objectives and simplest neural networks. The convergence guarantee of EF-SGD for the convex and non-smooth function (Assumption 2) is even weaker: Karimireddy et al. (2019) show the O σ convergence rate of EF-SGD only for the single-worker case (n = 1), which is 1 δ-times worse than Int SGD (δ could be fairly small, e.g., δ = 1/d in Top-1 compression). To the best of our knowledge, there is no convergence guarantee of EF-SGD for the non-smooth function when there are multiple workers. In contrast, our Int SGD has the same rate as SGD under the same set of assumptions. 4 DESIGN OF SCALING FACTORS 4.1 ADAPTIVE SCALING FACTOR WITH THE MOVING AVERAGE AND SAFEGUARD We now present an effective rule of adaptive αk (presented in Algorithm 1) that satisfies Assumption 1 for the convergence rates listed in previous section. In the appendix, we provide more options that also satisfy Assumption 1 and the proof still goes through. For simplicity, we assume that the first communication is exact, which allows us to estimate αk adaptively without worrying about α0. Proposition 2. Assumption 1 holds if we choose β [0, 1), ε 0 and αk = 2nrk/η2 k+ε2 , where rk = βrk 1 + (1 β) xk xk 1 2. Remark 1. Here β [0, 1) is a constant factor to control the moving average update of rk, which prevents the scaling factor αk from changing too rapidly. ε2 could be any sufficiently small number, which serves as a safeguard to avoid the potential divide by zero error. We study the sensitivity of our Int SGD to β and ε in Appendix C.4. 4.2 COMPRESSION EFFICIENCY Let us now discuss the number of bits needed for the compressed vectors. Although the main attraction of Int SGD is that it can perform efficient in-network communication, we may also hope to gain from the smaller size of the updates. Consider for simplicity the case where xk xk 1 ηkgk i with some i. The adaptive scheme with β = 0, ε = 0 gives αk = ηk 2n xk xk 1 ηk 2n xk+1 xk = 2n gk i , so that αkgk i = 2n gk i gk i 2n. Since we only use signed integers, we need at most 1 + log2 2n bits for each coordinate. For instance, for d 1010 and n 100, the upper bound is 1 + log2( 5 107) < 14 bits. The situation becomes even better when gk i gk i , i.e., when the stochastic gradients are dense. This property has been observed in certain empirical evaluations for deep neural networks; see for example the study in (Bernstein et al., 2018). 5 EXPERIMENTS We empirically compare our Int SGD algorithm with several representative and strong baselines: SGD, Heuristic Int SGD (Sapio et al., 2021), SGD, Power SGD + Error-feedback (EF) (Vogels et al., 2019), Nat SGD (Horváth et al., 2019), and QSGD (Alistarh et al., 2017). The experiments are performed on 16 NVIDIA Tesla V100 GPUs located on 8 compute nodes of a cluster (2 GPUs per node) following the Power SGD paper. The compute nodes in the cluster utilize Infini Band HDR-100 Director Switch at 100Gbps speed for network connection. The cluster also supports the NVIDIA Collective Communications Library (NCCL). Published as a conference paper at ICLR 2022 We consider two tasks: image classification by Res Net18 (He et al., 2016) on the CIFAR-10 dataset and language modeling by a 3-layer LSTM on the Wikitext-2 dataset. The neural network architectures and hyperparameters are from some public Py Torch implementations3. Our code is built on the codebase of Power SGD4. We also borrow their all-reduce-based implementations of SGD and Power SGD. It is worth noting that QSGD and Nat SGD do not support all-reduce. Thus, we implement their collective communications by all-gather. The implementations for compression and decompression in QSGD and Nat SGD are from the authors of Nat SGD5. For the sake of comparison, we also implement the all-gather-based SGD. We report the results of 3 repetitions with varying seeds. Apart from the Int SGD with randomized integer rounding (Int SGD (Random)) analyzed in our theory, we also consider the variant of Int SGD with deterministic integer rounding (Int SGD (Determ.)) which can use the Py Torch built-in function torch.round. For all Int SGD variants, we clip the local stochastic gradients to ensure that each aggregated value fits in either 8 bits or 32 bits. For more details of the experimental setup, please refer to Appendix C.1. 5.2 INTSGD VS. HEURISTIC INTSGD First, we compare our Int SGD with the most related algorithm Heuristic Int SGD (Sapio et al., 2021). For both algorithms, we consider two potential communication data types: int8 and int32. Note that the rule of scaling factor in Heuristic Int SGD is α = 2nb 1 n 2max_exp , where nb represents the number of bits to encode each coordinate and max_exp is the rounded exponent of the largest absolute value in the communicated package. Although this scaling rule is straightforward and avoids overflows, it cannot guarantee convergence, even with int32 as the communication data type. Indeed, the Heuristic Int SGD may fail to match the testing performance of full-precision SGD according to Figure 1. On the contrary, our Int SGD can perfectly match the performance of full-precision SGD on both image classification and language modeling tasks, which is in accordance with our theory that Int SGD is provably convergent. 0 50 100 150 200 250 300 Epochs Test Accuracy SGD Heuristic Int SGD(8-bit) Heuristic Int SGD(32-bit) 0 50 100 150 200 250 300 Epochs Test Accuracy SGD Int SGD(8-bit) Int SGD(32-bit) 0 20 40 60 80 Epochs Test Perplexity SGD Heuristic Int SGD(32-bit) Heuristic Int SGD(8-bit) 0 20 40 60 80 Epochs Test Perplexity SGD Int SGD(32-bit) Int SGD(8-bit) Figure 1: Comparison among Int SGD (8-bit or 32-bit), Heuristic Int SGD (8-bit or 32-bit), and fullprecision SGD on the tasks of training Res Net18 and LSTM. 5.3 INTSGD VS. OTHER BASELINES We also compare our Int SGD algorithms to the other baselines including the all-gather-based SGD, QSGD, Nat SGD and the all-reduce-based SGD, Power SGD (EF) on the two tasks. See the test performance and time breakdown in Table 2 and Table 3. First, we can observe that the all-gather based SGD + compressors (e.g., QSGD, Nat SGD) are indeed faster than the all-gather based full-precision SGD, which shows the benefit of lossy compressions. However, they are even much slower than the all-reduce-based full-precision SGD. Unfortunately, QSGD and Nat SGD) does not support the more efficient all-reduce primitive. Similar observation can be seen in previous works (Vogels et al., 2019; Agarwal et al., 2021). All of Power SGD (EF), Int SGD (Random), and Int SGD (Determ.) are consistently faster than the allreduce-based full-precision SGD on both tasks. Compared to Int SGD (Determ.), Int SGD (Random) leads to slightly more computation overhead due to the randomized rounding. However, Int SGD 3Res Net18: https://github.com/kuangliu/pytorch-cifar; LSTM: https://github. com/pytorch/examples/tree/master/word_language_model 4https://github.com/epfml/powersgd 5https://github.com/sands-lab/grace Published as a conference paper at ICLR 2022 Table 2: Test accuracy and time breakdown in one iteration (on average) of training Res Net18 on the CIFAR-10 dataset with 16 workers. All numbers of time are in millisecond (ms). In each column, the best one is highlighted in black and the second-best one is highlighted in gray. Algorithm Test Accuracy (%) Computation Overhead Communication Total Time SGD (All-gather) 94.65 0.08 - 261.29 0.98 338.76 0.76 QSGD 93.69 0.03 129.25 1.58 138.16 1.29 320.49 2.11 Nat SGD 94.57 0.13 36.01 1.30 106.27 1.43 197.18 0.25 SGD (All-reduce) 94.67 0.17 - 18.48 0.09 74.32 0.06 Power SGD (EF) 94.33 0.15 7.07 0.03 5.03 0.07 67.08 0.06 Int SGD (Determ.) 94.43 0.12 2.51 0.04 6.92 0.07 64.95 0.15 Int SGD (Random) 94.55 0.13 3.20 0.02 6.21 0.13 65.22 0.08 (Determ.) fails to match the testing performance of SGD on the language modeling task. Compared to Power SGD (EF), our Int SGD variants are better on the task of training Res Net18 but inferior on the task of training a 3-layer LSTM. Although Int SGD is not always better than Power SGD (EF), there are several scenarios where Int SGD is preferrable as explained in in Section 1 and Section 3.4. In addition, as seen in Figure 3 of the Appendix C.3, Power SGD (EF) converges much slower than SGD and Int SGD in the first 150 epochs of the Res Net training (which has non-smooth activations). Table 3: Test loss and time breakdown in one iteration (on average) of training a 3-layer LSTM on the Wiki-text2 dataset with 16 workers. All numbers of time are in millisecond (ms). In each column, the best one is highlighted in black and the second-best one is highlighted in gray. Algorithm Test Loss Computation Overhead Communication Total Time SGD (All-gather) 4.52 0.01 - 733.07 1.04 796.23 1.03 QSGD 4.63 0.01 43.67 0.11 307.63 1.16 399.10 1.25 Nat SGD 4.52 0.01 64.63 0.12 309.87 1.32 422.49 2.15 SGD (All-reduce) 4.54 0.03 - 22.33 0.02 70.46 0.05 Power SGD (EF) 4.52 0.01 4.22 0.01 2.10 0.01 54.89 0.02 Int SGD (Determ.) 4.70 0.02 3.04 0.01 6.94 0.05 57.93 0.03 Int SGD (Random) 4.54 0.01 4.76 0.01 7.14 0.04 59.99 0.01 6 CONCLUSION In this paper, we propose the provably convergent and computationally cheap Int SGD algorithm for efficient distributed machine learning. The core component of Int SGD is the adaptively estimated scaling factor shared by all users, which makes it compatible with the widely used communication primitive all-reduce and the recently proposed in-network aggregation (INA) (Sapio et al., 2021). The convergence rates of Int SGD match that of SGD up to constant factors on a broad spectrum of problems. Experimental results on two deep learning tasks show its promising empirical performance. A limitation of our algorithm is that its compression ratio is bounded by 4, but we hope to address this in a future work. Published as a conference paper at ICLR 2022 Reproducibility statement. Regarding the theoretical results: We describe the mathematical setting and algorithms in Section 1, 2, and Appendix A; Assumptions and the main theoretical results are presented in Section 3; We provide the complete proof for those results in Appendix B. Regarding the experimental results: We report the number of repetitions, the computing infrastructure used, the range of hyper-parameters considered, and the evaluation metrics in Section 5 and Appendix C.1; We attach our code in the supplementary material. Saurabh Agarwal, Hongyi Wang, Shivaram Venkataraman, and Dimitris Papailiopoulos. On the utility of gradient compression in distributed training systems. ar Xiv preprint ar Xiv:2103.00543, 2021. Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709 1720, 2017. Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In The 33th International Conference on Machine Learning, pp. 699 707, 2016. Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. Sign SGD: Compressed optimisation for non-convex problems. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 560 569, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression for distributed learning. ar Xiv:2002.12410, 2020. Lisandro Dalcín, Rodrigo Paz, and Mario Storti. MPI for Python. Journal of Parallel and Distributed Computing, 65(9):1108 1115, 2005. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Robert M. Gower, Mark Schmidt, Francis Bach, and Peter Richtárik. Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968 1983, 2020. Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. SGD: General analysis and improved rates. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5200 5209, Long Beach, California, USA, 09 15 Jun 2019. PMLR. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. Samuel Horváth, Chen-Yu Ho, L udovít Horváth, Atal Narayan Sahu, Marco Canini, and Peter Richtárik. Natural compression for distributed deep learning. ar Xiv preprint ar Xiv:1905.10988, 2019. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017. Published as a conference paper at ICLR 2022 Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26:315 323, 2013. Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes Sign SGD and other gradient compression schemes. In International Conference on Machine Learning, pp. 3252 3261, 2019. Sarit Khirirat, Hamid Reza Feyzmahdavian, and Mikael Johansson. Distributed learning with compressed gradients. ar Xiv preprint ar Xiv:1806.06573, 2018. Anastasia Koloskova, Tao Lin, Sebastian U Stich, and Martin Jaggi. Decentralized deep learning with arbitrary communication compression. ar Xiv preprint ar Xiv:1907.09356, 2019. Dmitry Kovalev, Samuel Horváth, and Peter Richtárik. Don t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Algorithmic Learning Theory, pp. 451 467. PMLR, 2020. Konstantin Mishchenko, Eduard Gorbunov, Martin Takáˇc, and Peter Richtárik. Distributed learning with compressed gradient differences. ar Xiv preprint ar Xiv:1901.09269, 2019. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8821 8831. PMLR, 18 24 Jul 2021. Mher Safaryan, Egor Shulgin, and Peter Richtárik. Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. ar Xiv preprint ar Xiv:2002.08958, 2020. Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan R. K. Ports, and Peter Richtárik. Scaling distributed machine learning with in-network aggregation. To appear in 18th USENIX Symposium on Networked Systems Design and Implementation, 2021. Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, pp. 4447 4458, 2018. Thijs Vogels, Sai Praneeth Karinireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances In Neural Information Processing Systems 32 (Nips 2019), 32(CONF), 2019. Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. ATOMO: Communication-efficient learning via atomic sparsification. Advances in Neural Information Processing Systems, 31:9850 9861, 2018. Hang Xu, Chen-Yu Ho, Ahmed M. Abdelmoniem, Aritra Dutta, El Houcine Bergou, Konstantinos Karatsenidis, Marco Canini, and Panos Kalnis. Compressed communication for distributed deep learning: Survey and quantitative evaluation. Technical report, 2020. Published as a conference paper at ICLR 2022 A OTHER VARIANTS OF INTSGD A.1 OTHER CHOICES OF SCALING FACTOR αk In Section 4.1, we provide an effective scaling factor with the moving average and the safeguard. However, there are other choices of scaling factor that also satisfy Assumption 1, and the convergence proof still goes through. Proposition 3 (Adaptive αk). If we choose 2n xk xk 1 , then Assumption 1 holds with ε = 0 and β = 0. One can also consider applying an integer quantization with individual values of αt for each coordinate or block, for instance, with an αt,l corresponding to the l-th layer in a neural network. It is straightforward to see that this modification leads to the error PB l=1 dl η2 k α2 k,l , where B is the total number of blocks and dl is the dimension of the l-th block. Proposition 4 (Adaptive block αk). Assume we are given a partition of all coordinates into B d blocks with dimensions d1, . . . , d B, and denote by (xk)l the l-th block of coordinates of xk. Then Assumption 1 holds with αk,(l) = ηk dl 2n (xk)l (xk 1)l , for l = 1, . . . , B. There are two extreme cases in terms of how we can choose the blocks. One extreme is to set B = 1, in which case we have a single scalar for the whole vector. The other extreme is to use B = d, which means that αk = ηk 2 n|xk xk 1|, where the division and absolute values are computed coordinate-wise. Algorithm 2 Int SGD: adaptive block quantization 1: Input: x0 Rd, β [0, 1), ε 0, x1 = x0 η0 1 n Pn i=1 g0 i a partitioning of Rd into B blocks of sizes d1, . . . , d B such that Rd = Rd1 Rd B 2: for k = 1, 2, . . . do 3: for each device i = 1, 2, . . . , n do 4: Compute independent stochastic gradients gk i (Ek[gk i ] fi(xk)) 5: Maintain the exponetial moving average: rk,l = βrk 1,l +(1 β) (xk)l (xk 1)l 2 {for each block l = 1, . . . , B} 6: Compute the adaptive scaling factors: αk,l = ηk dl q 2nrk,l+η2 k dl 7: Scale and round the local gradient (Q(gk i ))l = Int(αk,l(gk i )l) 8: end for 9: Aggregate Q(gk i ) by either all-reduce or in-network aggregation (INA) 10: for each device i = 1, 2, . . . , n do 11: Compute the (sub)gradient estimator: ( gk)l = 1 nαk,l Pn i=1(Q(gk i ))l 12: xk+1 = xk ηk gk 13: end for 14: end for Compression efficiency of Int SGD with adaptive block quantization. Our block-wise and coordinate-wise compression can further benefit from reduced dimension factors in the upper bounds, leading to the estimate of log2 dl 2 n bits for block with dimension dl. However, for smaller blocks it is less likely to happen that (xk)l (xk 1)l ηk(gk i )l , so the estimate should be taken Published as a conference paper at ICLR 2022 Algorithm 3 Int DIANA 1: Params: Stepsizes ηk, scaling vectors αk Rd 2: Init: x0 Rd, x1 = x0 η0 1 n Pn i=1 g0 i , h1 i = 0, h1 = 0 3: for k = 1, 2, . . . do 4: for each device i = 1, 2, . . . , n do 5: Compute stochastic gradient gk i (E gk i | xk fi(xk)). 6: Compute the adaptive scaling factor: αk = ηk 2n xk xk 1 7: Scale and round the local gradient Q(gk i ) = Int(αk (gk i hk i )) 8: Update the local shift hk+1 i = hk i + Q(gk i ) 9: end for 10: Aggregate Q(gk i ) by either all-reduce or in-network aggregation (INA) 11: for each device i = 1, 2, . . . , n do 12: Compute the (sub)gradient estimator: gk = hk + 1 nαk Pn i=1 Q(gk i ) 13: Update the model parameter xk+1 = xk ηk gk 14: Update global shift hk+1 = hk + 1 nαk Pn i=1 Q(gk i ) 15: end for 16: end for with a grain of salt. We hypothesize that using ε as in Proposition 2 is required to make block compression robust. Notice that if stochastic gradients have bounded coordinates, i.e., gk i G for all i, k, then we would need at most 1 + log2 ε bits to encode the integers. Since any ε σ + n G does not change the rate in the non-smooth case (see Equation (9)), we get for free the upper bound of 1 + log2 d G n G bits. A.2 HANDLING HETEROGENEOUS DATA Int SGD can be equipped with the full gradient or variance-reduced gradient estimator to enjoy faster convergence than O 1 shown in Corollary 2. For example, if we plug σ = 0 (no variance) and k (sufficiently small safeguard) into item 2 of Corollary 2, the convergence rate of Int SGD is O 1 k . However, when the data are heterogeneous (i.e., minimizing f(x) = 1 n Pn i=1 fi(x) will not make fi(x ) = 0, i [n]), the transmitted integer of Int SGD with σ = 0, ε = 0 can be gigantically large, which leads to very inefficient communications or even exception value error. E.g., if we choose the adaptive αk and the full gradient gk i = fi(xk), the largest integer to transmit from worker i to the master is αk fi(xk) fi(xk) xk xk 1 , where the denominator is 0 while the numerator is nonzero as the iterate converges to the optimum. To alleviate this issue, one needs to compress gradient differences as is done for example by Mishchenko et al. (2019) in their DIANA method. By marrying Int SGD with the DIANA trick, we obtain Int DIANA (Algorithm 3). For Int DIANA with adaptive αk, the largest transmitted integer from worker to the master is αk(gk i hk i ) gk i hk i xk xk 1 . We will show that both the nominator and the denominator are infinitesimal when xk converges to the optimum, such that the issue mentioned above can hopefully be solved. Note that we can either use the full gradient gk i = fi(xk) or the L-SVRG estimator Kovalev et al. (2020) gk i = fi(xk; ξk i ) fi(wk i ; ξk i ) + E fi(wk i ; ξ) on the i-th worker. For the variance-reduced method, we further assume that fi has a finite-sum structure, i.e., such that fi(x; ξ) = fil(x), E [ fi(x; ξ)] = 1 m Pm l =1 fil (x) and l is sampled from [m] uniformly at random by ξ. Our main convergence theorem describing the behavior of Int DIANA follows: Published as a conference paper at ICLR 2022 Theorem 4. Assume that f is µ-strongly convex (µ 0) and f( ; ξ) has Li-Lipschitz gradient for any ξ, L def = 4 maxi Li. 1. If µ > 0, the iterates of Int DIANA with adaptive αk = η d n xk xk 1 satisfy For Int DIANA with the GD estimator gk i = fi(xk), we have θ def = max 1 ηµ, 3 1 and Ψk def = xk x 2 + xk xk 1 2 + η2L2 i=1 hk i fi(x ) 2, where ηk = η 1 2(L+ L 32n ); For Int DIANA with the L-SVRG estimator, we have θ def = max 1 ηµ, 3 4, 1 3 8m < 1 and Ψk def = xk x 2 + xk xk 1 2 + 8η2 n2 Pn i=1 Pm l=1 fil(wk i ) fil(x ) 2 + 4n2 Pn i=1 hk i fi(x ) 2, where p = 1 m, and ηk = η 1 2(L+2L/n). 2. If µ = 0, the iterates of Int DIANA with adaptive αk = η d n xk xk 1 satisfy E f(ˆxk) f(x ) Ψ0 η(k+1), where ˆxk = 1 k+1 Pk i=0 xi. Int DIANA with the GD estimator requires that ηk = η 1 4(L+ L 32n ), Int DIANA with the L-SVRG estimator requires that ηk = η 1 4(L+2L/n). The above theorem establishes linear convergence of two versions of Int DIANA in the strongly convex regime and sublinear convergence in the convex regime. Compression efficiency of Int DIANA. If µ > 0, for Int DIANA with adaptive αk and either GD or L-SVRG estimator, both hk i fi(x ) 2 and xk xk 1 2 converge to 0 linearly at the same rate, while gk i fi(x ). Thus, the largest integer to transmit is αk(gk i hk i ) gk i hk i xk xk 1 is hopefully upper bounded. B.1 PROOFS FOR INTSGD In the section, we provide the complete convergence proof of Int SGD. B.1.1 PROOFS FOR LEMMA 1 Proof. Take y = α x and let py = y [y], where [y] is the coordinate-wise floor, and py is the vector of probabilities in the definition of Int(y). By definition it holds E [Int(y)] = py([y] + 1) + (1 py)[y] = py + [y] = y [y] + [y] = y. Plugging back y = α x, we obtain the first claim. y Int(y) = max j=1,...,d yj Int(yj) max z R max z [z], [z] + 1 z = 1. After substituting y = α x, it remains to mention 1 α Int(α x) x Int(α x) α x max j=1,...,d 1 αj . Published as a conference paper at ICLR 2022 To obtain the last fact, notice that Int(y) [y] is a vector of Bernoulli random variables. Since the variance of any Bernoulli variable is bounded by 1 " 1 α Int(α x) x 1 α2 j E (Int(yj) yj)2 The staring point of our analysis is the following recursion. Let ρk def = xk x 2, δk def = f(xk) f(x ) and ζk def = 1 n Pn i=1 gk i 2. Lemma 2. Assume that either i) functions f1, . . . , fn are convex, or ii) f is convex and f1, . . . , fn are differentiable. Then Ek ρk+1 ρk 2ηkδk + Ak + Bk, where Ak def = 2η2 k Ek ζk and Bk def = 1 2n Pd j=1 η2 k α2 k,j xk+1 xk 2 are the SGD and quantization error terms, respectively. B.1.2 PROOF OF LEMMA 2 Proof. The last term in the expression that we want to prove is needed to be later used to compensate quantization errors. For this reason, let us save one xk+1 xk 2 for later when expanding xk+1 x 2. Consider the Int SGD step xk+1 xk = ηk 1 n Pn i=1 Q(gk i ), where Q(gk i ) = 1 αk Int(αk gk i ). xk+1 x 2 = xk x 2 + 2 xk+1 xk, xk x + xk+1 xk 2 = xk x 2 + 2 xk+1 xk, xk x + 2 xk+1 xk 2 xk+1 xk 2 = xk x 2 2ηk i=1 Q(gk i ), xk x + 2η2 k i=1 Q(gk i ) 2 xk+1 xk 2. Now let us forget about the first and last terms, which will directly go into the final bound, and work with the other terms. By our assumptions, we either have Ek[Q(gk i )] = Ek[gk i ] fi(xk) or 1 n Pn i=1 Ek[gk i ] = f(xk), so we obtain by convexity i=1 Q(gk i ), xk x i=1 Ek[gk i ], xk x 2ηk(f(xk) f(x )). Moreover, using the tower property of expectation, we can decompose the penultimate term in (10) as follows: i=1 Q(gk i ) i=1 Q(gk i ) i=1 EQ[Q(gk i )] i=1 (Q(gk i ) EQ[Q(gk i )]) i=1 Ek Q(gk i ) gk i 2 where in the last step we also used independence of the quantization errors (Q(gk 1) gk 1), . . . , (Q(gk n) gk n). Next, we are going to deal with the quantization terms: i=1 Ek Q(gk i ) gk i 2 " 1 αk Int(αk gk i ) gk i 1 4α2 k,j = n Dividing both sides by n2 and plugging it into (10), we obtain the desired decomposition into SGD and quantization terms. Published as a conference paper at ICLR 2022 We now show how to control the quantization error by choosing the scaling vector αk in accordance with Assumption 1. Lemma 3. If the assumptions of Lemma 2 hold together with Assumption 1, then E ρk+1 ρ0 2 k P t=0 ηt E [δt] + 2 k P t=0 η2 t E [ζt] + ε2 B.1.3 PROOF OF LEMMA 3 Proof. Firstly, let us recur the bound in Lemma 2 from k to 0: E xk+1 x 2 x0 x 2 2 t=0 ηt E f(xt) f(x ) + 2 " η2 k α2 k,j t=0 E xt+1 xt 2 . Note that in the bound we do not have α0,j for any j as we assume that the first communication is done without compression. Assumption 1 implies for the quantization error " η2 t α2 t,j η2 t ε2 + (1 β) l=0 βl E xt l xt l 1 2 t=1 η2 t + (1 β) E xt xt 1 2 k t X t=1 η2 t + (1 β) E xt xt 1 2 X t=1 E xt xt 1 2 . (11) It is clear that the latter terms get canceled when we plug this bound back into the first recursion. B.1.4 PROOF OF THEOREM 1 Proof. Most of the derivation has been already obtained in Lemma 3 and we only need to take care of the SGD terms. To do that, we decompose the gradient error into expectation and variance: i=1 Ek[gk i ] i=1 (gk i Ek[gk i ]) i=1 Ek[gk i ] " gk i Ek[gk i ] (6) G2 + σ2 Thus, we arrive at the following corollary of Lemma 3: 0 E xk+1 x 2 x0 x 2 2 t=0 ηt E f(xt) f(x ) + 2 Furthermore, by convexity of f we have f(ˆxk) f(x ) 1 Pk t=0 ηt t=0 ηt(f(xt) f(x )). (12) Plugging it back, rearranging the terms and dropping E xk+1 x 2 gives the result. Published as a conference paper at ICLR 2022 B.1.5 PROOF OF PROPOSITION 1 Proof. Fix any i. By Young s inequality and independence of ξk 1, . . . , ξk n we have i=1 fi(xk; ξk i ) i=1 fi(x ; ξk i ) fi(xk; ξk i ) fi(x ; ξk i ) i=1 E fi(x ; ξk i ) 2 + 2E fi(xk; ξk i ) fi(x ; ξk i ) Substituting the definition of σ2 and applying Jensen s inequality, we derive i=1 E fi(xk; ξk i ) fi(x ; ξk i ) 2 . By our assumption, fi( ; ξ) is convex and has Li-Lipschitz gradient, so we can use Equation (2.1.7) in Theorem 2.1.5 in Nesterov (2013): 2E fi(xk; ξk i ) fi(x ; ξk i ) 2 4Li E fi(xk; ξk i ) fi(x ; ξk i ) fi(x ; ξk i ), xk x = 4Li E fi(xk) fi(x ) fi(x ), xk x LE fi(xk) fi(x ) fi(x ), xk x . Taking the average over i = 1, . . . , n and noticing Pn i=1 fi(x ) = 0 yields i=1 E fi(xk) fi(x ) fi(x ), xk x = σ2 n + LE f(xk) f(x ) , which is exactly our claim. B.1.6 PROOF OF THEOREM 2 Proof. The proof is almost identical to that of Theorem 1, but now we directly use Assumption 3 and plug it in inside Lemma 3 to get E xk+1 x 2 x0 x 2 2 t=0 ηt E f(xt) f(x ) + 2 t=0 2ηt(1 ηt L)E f(xt) f(x ) + 2 σ2 n + ε2 t=0 ηt E f(xt) f(x ) + 2 σ2 n + ε2 Rearranging this inequality yields t=0 ηt E f(xt) f(x ) x0 x 2 E xk+1 x 2 + 2 σ2 n + ε2 x0 x 2 + 2 σ2 n + ε2 To finish the proof, it remains to upper bound f(ˆxk) using convexity the same way as it was done in Equation (12). Published as a conference paper at ICLR 2022 B.1.7 PROOF OF THEOREM 3 Proof. By L-smoothness of f we have Ek[f(xk+1)] f(xk) + Ek[ f(xk), xk+1 xk ] + L 2 Ek[ xk+1 xk 2] i=1 Ek[ f(xk), Q(gk i ) ] + L 2 Ek[ xk+1 xk 2] (3)= f(xk) ηk f(xk) 2 + L 2 Ek[ xk+1 xk 2] = f(xk) ηk f(xk) 2 + LEk[ xk+1 xk 2] L 2 Ek[ xk+1 xk 2] = f(xk) ηk f(xk) 2 + η2 k LEk i=1 Q(gk i ) 2 Ek[ xk+1 xk 2]. Similarly to Lemma 2, we get a decomposition into SGD and quantization errors: i=1 Q(gk i ) i=1 Q(gk i ) i=1 Ek[ Q(gk i ) gk i 2] We proceed with the two terms separately. To begin with, we further decompose the SGD error into its expectation and variance: i=1 Ek[gk i ] i=1 (gk i Ek[gk i ]) = f(xk) 2 + 1 i=1 Ek gk i fi(xk) 2 (8) f(xk) 2 + σ2 Moving on, we plug it back into the upper bound Ek[f(xk+1)]. Assuming ηk 1 2L, we get Ek f(xk+1) f(xk) ηk(1 ηk L) f(xk) 2 + η2 k Lσ2 η2 k α2 k,j L 2 Ek[ xk+1 xk 2] 2 f(xk) 2 + η2 k Lσ2 η2 k α2 k,j L 2 Ek[ xk+1 xk 2]. Finally, reusing Equation (11) produces the bound E f(xk+1) f(x0) 2 E f(xt) 2 + σ2 t=0 η2 t L + ε2 t=0 η2 t L. Notice that by Assumption 4 f inf f(xk+1), so we have 1 Pk t=0 ηt t=0 ηt E f(xt) 2 2 f(x0) f inf + Pk t=0 η2 t L Pk t=0 ηt . The left-hand side is equal to E f(ˆxk) 2 by definition of ˆxk, and we conclude the proof. Published as a conference paper at ICLR 2022 B.1.8 PROOF OF COROLLARY 2 Proof. For the first part, we have Pk t=0 η2 t 2 Pk t=0 ηt = O 1 Pk t=0 ηt The other complexities follow similarly. B.1.9 PROOF OF PROPOSITION 2 Proof. By definition of αk " η2 k α2 k,j = η2 kε2 + 2n E [rk] = η2 kε2 + 2n(1 β) t=0 βt xk t xk t 1 2. B.1.10 PROOF OF PROPOSITION 3 Proof. Indeed, we only need to plug in the values of αk,j: " η2 k α2 k,j = 2n E xk xk 1 2 β=0 = 2n(1 β) t=0 βt E xk t xk t 1 2 . B.1.11 PROOF OF PROPOSITION 4 Proof. Since the l-th block has dl coordinates, we get " η2 k α2 k,j " η2 k α2 k,(l) l=1 E (xk)l (xk 1)l 2 = 2n E xk xk 1 2 . B.2 PROOFS FOR INTDIANA Assumption 5. fil(x) has Lil-Lipschitz gradient. We define L def = 4 maxi [n] maxl [m] Lil. Proposition 5. Suppose that Assumption 5 holds. Then, we have the following for Int DIANA and any x Rd: 1 mn l=1 fil(x) fil(x ) 2 L 2 (f(x) f(x )) (13) Proof. Based on Assumption 5 and Theorem 2.1.5 Nesterov (2013), we have: fil(x) fil(x ) 2 2Lil (fil(x) fil(x ) fil(x ), x x ) Thus, double averaging leads to: l=1 fil(x) fil(x ) 2 l=1 Lil (fil(x) fil(x ) fil(x ), x x ) 2 max i max l Lil f(x) f(x ) 1 l=1 fil(x ), x x Considering that 1 mn Pn i=1 Pm l=1 fil(x ) = 0 and defining 4 maxi [n] maxl [m] Lil leads to the claim in the proposition. Published as a conference paper at ICLR 2022 Lemma 4. For Int DIANA (Algorithm 3) and gk def = 1 n Pn i=1(hk i + Q(gk i )), we have Ek gk = f(xk) and: Proof. By definition, gk = 1 n Pn i=1(hk i + Q(gk i )), so i=1 hk i + 1 i=1 Ek Q(gk i ) (3) = 1 i=1 hk i + 1 i=1 Ek gk i 1 i=1 hk i = f(xk). Thus, we have shown that gk is an unbiased estimate of f(xk). Let us proceed with the second moment of gk: Ek gk 2 = Ek αk Int(αk (gk i hk i )) (gk i hk i ) + gk i αk Int(αk (gk i hk i )) (gk i hk i ) " 1 αk Int(αk (gk i hk i )) (gk i hk i ) Lemma 5. If L-SVRG estimator gk i = fil(xk; ξk i ) fil(wk i ; ξk i ) + uk i is used in Int DIANA, we have Ek gk i = fi(xk) and f(xk) f(x ) + 2 nσk 1, (15) Ek σk+1 1 (1 p)σk 1 + p L 2 f(xk) f(x ) , (16) where σk 1 = 1 mn Pn i=1 Pm l=1 fil(wk i ) fil(x ) 2. Proof. Recall that E X E [X] 2 E X 2 for any random variable X. For the L-SVRG estimator gk i = fil(xk; ξk i ) fil(wk i ; ξk i ) + uk i , we have: gk i fi(xk) 2L f(xk) f(x ) + 1 fil(xk) fil(wk i ) 1 fil (xk) fil (wk i ) 2L f(xk) f(x ) + 1 fil(xk) fil(wk i ) 2 2L f(xk) f(x ) + 2 fil(xk) fil(x ) 2 + 2 fil(wk i ) fil(x ) 2 (13) 2L + L f(xk) f(x ) + 2 Published as a conference paper at ICLR 2022 where σk 1 = 1 mn Pn i=1 Pm l=1 fil(wk i ) fil(x ) 2. Based on the update of control sequence in L-SVRG, we have: Ek σk+1 1 = 1 p fil(wk i ) fil(x ) 2 + p mn fil(xk) fil(x ) 2 (13) (1 p)σk 1 + p L 2 f(xk) f(x ) . Lemma 6. Define σk 2 def = 1 n Pn i=1 hk i fi(x ) 2. For Int DIANA algorithm, we have: Ek σk+1 2 1 i=1 Ek gk i fi(x ) 2 + For the full gradient, we have 1 n Pn i=1 Ek gk i fi(x ) 2 L 2 (f(xk) f(x )). For the LSVRG estimator, we have 1 n Pn i=1 Ek gk i fi(x ) 2 4σk 1 + 3L(f(xk) f(x )). Proof. We define σk 2 def = 1 n Pn i=1 hk i fi(x ) 2. Consider the step hk+1 i = hk i + Q(gk i ), where Q(gk i ) = 1 αk Int(αk (gk i hk i )). Note that Ek gk i hk i , gk i 2 fi(x ) + hk i = Ek gk i fi(x ) 2 hk i fi(x ) 2 , which explains the last equality below: Ek σk+1 2 = Ek i=1 hk i fi(x ) + Q(gk i ) 2 # " 1 αk Int αk (gk i hk i ) i=1 Ek Q(gk i ), hk i fi(x ) (4) = σk 2 + 1 i=1 Ek gk i hk i 2 + 2 1 i=1 gk i hk i , hk i fi(x ) + i=1 Ek gk i hk i , gk i 2 fi(x ) + hk i + i=1 Ek gk i fi(x ) 2 + For the full gradient gk i = fi(xk), we have: i=1 Ek gk i fi(x ) 2 = 1 i=1 Ek fi(xk) fi(x ) 2 L 2 (f(xk) f(x )). For the L-SVRG estimator, we have by Young s inequality: i=1 Ek gk i fi(x ) 2 2 i=1 Ek gk i fi(xk) 2 + 2 i=1 fi(xk) fi(x ) 2 l=1 fil(xk) fil(wk i ) 2 + L(f(xk) f(x )) l=1 fil(xk) fil(x ) 2 + 4 mn l=1 fil(wk i ) fil(x ) 2 + L(f(xk) f(x )) (13) 4σk 1 + 3L(f(xk) f(x )). Published as a conference paper at ICLR 2022 Lemma 7. Suppose that Assumption 5 holds. Besides, we assume that f( ) is µ-strongly convex (µ 0). For Int DIANA with adaptive αk = ηk d n xk xk 1 and GD gradient estimator, we have: Ek xk+1 x 2 + Ek xk+1 xk 2 (1 ηkµ) xk x 2 + 1 2 xk xk 1 2 2ηk(1 2ηk L)(f(xk) f(x )), Ek σk+1 2 L 2 (f(xk) f(x )) + n xk xk 1 2. For Int DIANA with adaptive αk and L-SVRG gradient estimator, we have: Ek xk+1 x 2 + Ek xk+1 xk 2 (1 ηkµ) xk x 2 + 1 2 xk xk 1 2 2ηk (f(xk) f(x )) + 4η2 k n σk 1, Ek σk+1 1 (1 p)σk 1 + p L 2 (f(xk) f(x )), Ek σk+1 2 4σk 1 + 3L(f(xk) f(x )) + n xk xk 1 2. Proof. By µ-strong convexity, we have: i=1 Q(gk i ), xk x i=1 Ek[gk i ], xk x 2ηk(f(xk) f(x )) ηkµ xk x 2. Besides, xk+1 xk 2 = 2η2 k gk 2 xk+1 xk 2, so Ek xk+1 x 2 + Ek xk+1 xk 2 = (1 ηkµ) xk x 2 2ηk(f(xk) f(x )) + 2η2 k Ek h gk 2i (14) (1 ηkµ) xk x 2 2ηk(f(xk) f(x )) + 2η2 k Ek η2 k α2 k,j Applying Proposition 3 to the obtained bound results in the following recursion Ek xk+1 x 2 + Ek xk+1 xk 2 (1 ηkµ) xk x 2 + 1 2Ek xk xk 1 2 2ηk(f(xk) f(x )) + 2η2 k Ek With the GD estimator, the produced bound simplifies to Ek xk+1 x 2 + Ek xk+1 xk 2 (1 ηkµ) xk x 2 + 1 2 xk xk 1 2 2ηk(1 2ηk L)(f(xk) f(x )). Based on Lemma 6, the following is satisfied for Int DIANA with GD estimator and adaptive αk = ηk d n xk xk 1 : Ek σk+1 2 L 2 (f(xk) f(x )) + n xk xk 1 2. In turn, Lemma 5 gives for L-SVRG estimator Ek h 1 n Pn i=1 gk i 2i 2L + L n f(xk) f(x ) + 2 nσk 1, so we can derive that Ek xk+1 x 2 + Ek xk+1 xk 2 (1 ηkµ) xk x 2 + 1 2 xk xk 1 2 2ηk (f(xk) f(x )) + 4η2 k n σk 1. Published as a conference paper at ICLR 2022 Let us now combine Equation (16) and Lemma 6: Ek σk+1 1 (1 p)σk 1 + p L 2 (f(xk) f(x )), Ek σk+1 2 4σk 1 + 3L(f(xk) f(x )) + n xk xk 1 2. Lemma 8. We define the Lyapunov function as Ψk def = xk x 2 + xk xk 1 2 + c1η2 kσk 1 + c2η2 kσk 2. Assume that the conditions of Lemma 7 hold. If µ > 0, we have: E Ψk+1 θE Ψk , where θ def = max (1 ηkµ), 3 4 , c1 = 0, c2 = L2 4n and ηk 1 2(L+ L 32n ) for Int DIANA with GD estimator. Alternatively, for Int DIANA with L-SVRG estimator, we have θ def = max (1 ηkµ), 3 4, 1 3 8m and set c1 = 8m n , c2 = L2 4n, p = 1 m, and ηk 1 2(L+2L/n). If µ = 0, we have ηk E f(xk) f(x ) E Ψk E Ψk+1 , where ηk 1 4(L+ L 32n ) for the GD variant and ηk 1 4(L+2L/n) for the L-SVRG variant. Proof. We define the Lyapunov function as Ψk def = xk x 2 + xk xk 1 2 + c1η2 kσk 1 + c2η2 kσk 2. For Int DIANA with GD estimator, we can set c1 = 0 and derive the following inequality from Lemma 7 and ηk+1 ηk: E Ψk+1 (1 ηkµ)E xk x 2 + 1 2 + c2η2 kn E xk xk 1 2 E f(xk) f(x ) . We first consider µ > 0 case. Let c2 = L2 4n, and ηk 1 2(L+ L 32n ). We have E Ψk+1 max (1 ηkµ), 3 For Int DIANA with L-SVRG estimator, we have the following based on Lemma 7: E Ψk+1 (1 ηkµ)E xk x 2 + 1 2 + c2η2 kn E xk xk 1 2 n + 4c2 + (1 p)c1 E f(xk) f(x ) Let c1 = 8m n , c2 = L2 4n, p = 1 m, and ηk 1 2(L+2L/n). Plugging these values into the recursion, we get E Ψk+1 max (1 ηkµ), 3 4, 1 3 8m E Ψk . If µ = 0, we instead let ηk 1 4(L+ L 32n ) for the GD variant and ηk 1 4(L+2L/n) for the L-SVRG variant to obtain from the same recursions: ηk E f(xk) f(x ) E Ψk E Ψk+1 . C DETAILS AND ADDITIONAL RESULTS OF EXPERIMENTS C.1 MORE DETAILS Here we provide more details of our experimental setting. We use the learning rate scaling technique (Goyal et al., 2017; Vogels et al., 2019) with 5 warm-up epochs. As suggested in previous works Published as a conference paper at ICLR 2022 (Vogels et al., 2019; Alistarh et al., 2017; Horváth et al., 2019), we tune the initial single-worker learning rate on the full-precision SGD and then apply it to Power SGD, QSGD, and Nat SGD. For the task of training Res Net18 on the CIFAR-10 dataset, we utilize momentum β = 0.9 and weight decay with factor 10 4 (except the Batchnorm parameters) for all algorithms. All algorithms run for 300 epochs. The learning rate decays by 10 times at epoch 150 and 250. The initial learning rate is tuned in the range {0.05, 0.1, 0.2, 0.5} and we choose 0.1. For the task of training a 3-layer LSTM, all algorithms run for 90 epochs. We set the size of word embeddings to 650, the sequence length to 30, the number of hidden units per layer to 650, and the dropout rate to 0.4. Besides, we tie the word embedding and softmax weights. We tune the initial learning rate in the range of {0.6, 1.25, 2.5, 5} and we choose 1.25. For both tasks, we report the results based on 3 repetitions with random seeds {0, 1, 2}. To measure the time of computation, communication, and compression/decompression of the algorithms, we use the timer (a Python context manager) implemented in the Power SGD code6. For Power SGD, we use rank = 2 in the task of training Res Net18 on the CIFAR-10 dataset and rank = 4 in the task of training LSTM on the Wikitext-2 dataset as suggested by Vogels et al. (2019). For QSGD, we use the gradient matrix of each layer as a bucket and set the number of quantization levels to be 64 (6-bit). C.2 TOY EXPERIMENT ON TIMINGS #Coordinates All-Reduce Time (MS) 8 Nodes x 2 GPUs/Node Figure 2: Time of communicating FP32 and Int8 messages based on all-reduce. Figure 2 shows the different manners of Power SGD and Int SGD to save the communication time based on the all-reduce compared to full-precision SGD: 1) Int SGD (8-bit) communicates the data with int8 data type but does not reduce the number of coordinates; 2) Power SGD does not change the data type but breaks one communication round of a big number of coordinates into three communication rounds of much smaller numbers of coordinates. C.3 CONVERGENCE CURVES OF THE DEEP LEARNING TASKS Please see Figure 3 and Figure 4. C.4 SENSITIVITY ANALYSIS OF HYPERPARAMETERS We analyze the sensitivity of Int SGD to its hyperparameters β and ε. As shown in Figure 5, the performance of Int SGD is quite stable across the choices of β {0.0, 0.3, 0.6, 0.9} and ε {10 4, 10 6, 10 8} on the two considered tasks. Overall, β = 0.9 and ε = 10 8 is a good default setting for our Int SGD algorithm. 6https://github.com/epfml/powersgd Published as a conference paper at ICLR 2022 0 50 100 150 200 250 300 Epochs Test Accuracy SGD Power SGD(EF) Int SGD(Determ.) Int SGD(Random) 0 50 100 150 200 250 300 Epochs Test Accuracy SGD QSGD Nat SGD Int SGD(Determ.) Int SGD(Random) 0 50 100 150 200 250 300 Epochs SGD Power SGD(EF) Int SGD(Determ.) Int SGD(Random) 0 50 100 150 200 250 300 Epochs SGD QSGD Nat SGD Int SGD(Determ.) Int SGD(Random) Figure 3: Convergence curves of Int SGD (Random) and Int SGD (Determ.) and the baseline algorithms on the task of training Res Net18 on the CIFAR-10 dataset. 0 20 40 60 80 Epochs SGD Power SGD(EF) Int SGD(Determ.) Int SGD(Random) 0 20 40 60 80 Epochs SGD QSGD Nat SGD Int SGD(Determ.) Int SGD(Random) 0 20 40 60 80 Epochs Test Perplexity SGD Power SGD(EF) Int SGD(Determ.) Int SGD(Random) 0 20 40 60 80 Epochs Test Perplexity SGD QSGD Nat SGD Int SGD(Determ.) Int SGD(Random) Figure 4: Convergence curves of Int SGD (Random) and Int SGD (Determ.) and the baseline algorithms on the task of training a 3-layer LSTM on the Wikitext-2 dataset. Published as a conference paper at ICLR 2022 10 8 10 6 10 4 " 0.9 0.6 0.3 0.0 94.6 94.6 94.6 94.5 94.5 94.4 94.5 94.6 94.5 94.7 94.7 94.3 Test Accuracy (%) " 10 8 10 6 10 4 " 0.9 0.6 0.3 0.0 4.57 4.58 4.57 4.59 4.59 4.59 4.59 4.59 4.59 4.60 4.59 4.60 Test Loss # Figure 5: Test accuracy (on the image classification task) and test loss (on the language modeling task) of Int SGD under different hyperparameters β and ε. or denotes the larger, the better or vice versa. C.5 LOGISTIC REGRESSION EXPERIMENT Setup: We run the experiments on the ℓ2-regularized logistic regression problem with four datasets (a5a, mushrooms, w8a, real-sim) from the Lib SVM repository7, where l=1 log(1 + exp( A i,lxbi,l)) + λ2 and x Rd, λ2 is chosen proportionally to 1 mn and Ai,l Rd, bi,l { 1, 1} are the feature and label of l-th data point on the i-th worker. The experiments are performed on a machine with 24 Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHz cores, where 12 cores are connected to a socket (there are two sockets in total). All experiments use 12 cpu cores and each core is utilized as a worker. The communications are implemented based on the MPI4PY library Dalcín et al. (2005). The optimum x is obtained by running GD with the whole data using one cpu core until there are 5000 iterations or f(x) 2 10 30. Table 4: Information of the experiments on ℓ2-regularized logistic regression. Dataset #Instances N Dimension d λ2 a5a 6414 123 5 10 4 mushrooms 8124 112 6 10 4 w8a 49749 300 10 4 real-sim 72309 20958 5 10 5 The whole dataset is split according to its original indices into n folds, and each fold is assigned to a local worker, i.e., the data are heterogeneous. There are m = N n data points on each worker. For each dataset, we run each algorithm multiples times with 20 random seeds for each worker. For the stochastic algorithms, we randomly sample 5% of the local data as a minibatch (i.e., batch size τ = m 20 ) to estimate the stochastic gradient gk i on each worker. We set p = τ m in VR-Int DIANA. 7https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html Published as a conference paper at ICLR 2022 Apart from Int SGD with gk i = fi(xk) (Int GD), we also evaluate Int DIANA (Algorithm 3) with the GD or L-SVRG estimator (called Int DIANA and VR-Int DIANA, respectively). 0 1000 2000 3000 4000 #grad/mn Int GD( = 0) Int DIANA( = 0) VR-Int DIANA( = 0) 0 1000 2000 3000 4000 5000 #grad/mn mushrooms, n=12 Int GD( = 0) Int DIANA( = 0) VR-Int DIANA( = 0) 0 1000 2000 3000 4000 #grad/mn Int GD( = 0) Int DIANA( = 0) VR-Int DIANA( = 0) 0 1000 2000 3000 4000 #grad/mn real-sim, n=12 Int GD( = 0) Int DIANA( = 0) VR-Int DIANA( = 0) 0 1000 2000 3000 4000 #grad/mn Largest integer in i = 1 Q(gk i ) Int GD( = 0) Int DIANA( = 0) VR-Int DIANA( = 0) 0 1000 2000 3000 4000 5000 #grad/mn Largest integer in i = 1 Q(gk i ) mushrooms, n=12 Int GD( = 0) Int DIANA( = 0) VR-Int DIANA( = 0) 0 1000 2000 3000 4000 #grad/mn Largest integer in i = 1 Q(gk i ) Int GD( = 0) Int DIANA( = 0) VR-Int DIANA( = 0) 0 1000 2000 3000 4000 #grad/mn Largest integer in i = 1 Q(gk i ) real-sim, n=12 Int GD( = 0) Int DIANA( = 0) VR-Int DIANA( = 0) Figure 6: Objective gaps and the max integer in the aggregated vector Pn i=1 Q(gk i ). As shown in Figure 6, Int SGD with gk i = fi(xk) (Int GD) suffers from low compression efficiency issue (very large integer in the aggregated vector Pn i=1 Q(gk i )) and Int DIANA can solve this issue and only requires less then 3 bits per coordinate in the communication. Int DIANA with SVRG gradient estimator (VR-Int DIANA) further improves Int DIANA with GD estimator in terms of gradient oracles.