# on_biased_compression_for_distributed_learning__4074aa7f.pdf Journal of Machine Learning Research 24 (2023) 1-50 Submitted 12/21; Revised 8/23; Published 10/23 On Biased Compression for Distributed Learning Aleksandr Beznosikov anbeznosikov@gmail.com Computer, Electrical and Math. Sciences and Engineering Division King Abdullah University of Science and Technology, 23955, Thuwal, KSA Skolkovo Institute of Science and Technology, 121205, Moscow, Russia School of Applied Mathematics and Informatics Moscow Institute of Physics and Technology, 141701, Moscow, Russia Samuel Horváth samuel.horvath@kaust.edu.sa Computer, Electrical and Math. Sciences and Engineering Division King Abdullah University of Science and Technology, 23955, Thuwal, KSA Peter Richtárik peter.richtarik@kaust.edu.sa Computer, Electrical and Math. Sciences and Engineering Division King Abdullah University of Science and Technology, 23955, Thuwal, KSA Mher Safaryan mher.safaryan.1@kaust.edu.sa Computer, Electrical and Math. Sciences and Engineering Division King Abdullah University of Science and Technology, 23955, Thuwal, KSA Editor: Silvia Villa In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate O δL exp[ µK δL ] + (C+δD) Kµ , where δ 1 is a compression parameter which grows when more compression is applied, L and µ are the smoothness and strong convexity constants, C captures stochastic gradient noise (C = 0 if full gradients are The work in Sections 1-5 was conducted while A. Beznosikov was a research intern in the Optimization and Machine Learning Lab of Peter Richtárik at KAUST; this visit was funded by the KAUST Baseline Research Funding Scheme. The work of A. Beznosikov in Section 6 was conducted in Skoltech and was supported by Ministry of Science and Higher Education grant No. 075-10-2021-068. Alphabetical author order. Samuel Horváth is currently affiliated with Mohamed bin Zayed University of Artificial Intelligence (MBZUAI). Mher Safaryan is currently affiliated with Institute of Science and Technology Austria (ISTA). c 2023 Aleksandr Beznosikov and Samuel Horváth and Peter Richtárik and Mher Safaryan. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1548.html. Beznosikov and Horváth and Richtárik and Safaryan computed on each node) and D captures the variance of the gradients at the optimum (D = 0 for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance. Keywords: Compression operators, biased compressors, distributed learning, linear convergence, error feedback 1. Introduction In order to achieve state-of-the-art performance, modern machine learning models need to be trained using large corpora of training data, and often feature an even larger number of trainable parameters (Vaswani et al., 2019; Brown et al., 2020). The data is typically collected in a distributed manner and stored across a network of edge devices, as is the case in federated learning (Konečný et al., 2016; Mc Mahan et al., 2017; Li et al., 2019; Kairouz and et al, 2019), or collected centrally in a data warehouse composed of a large collection of commodity clusters. In either scenario, communication among the workers is typically the bottleneck. Motivated by the need for more efficient training methods in traditional distributed and emerging federated environments, we consider optimization problems of the form where x Rd collects the parameters of a statistical model to be trained, n is the number of workers/devices, and fi(x) is the loss incurred by model x on data stored on worker i. The loss function fi : Rd R often has the form fi(x) := Eξ Pi [fξ(x)] , with Pi being the distribution of training data owned by worker i. In federated learning applications, these local distributions can be very different and we do not impose any similarity assumption for them. 1.1 Distributed optimization A fundamental baseline for solving problem (1) is (distributed) gradient descent (GD), performing updates of the form xk+1 = xk ηk i=1 fi(xk), where ηk > 0 is a stepsize. Due to the communication issues inherent to distributed systems, several enhancements to this baseline have been proposed that can better deal with the communication cost challenges of distributed environments, including acceleration (Nesterov, 2013; Beck and Teboulle, 2009; Allen-Zhu, 2017), reducing the number of iterations via momentum, local methods (Mc Mahan et al., 2017; Khaled et al., 2020a; Karimireddy et al., On Biased Compression for Distributed Learning 2019a), reducing the number of communication rounds via performing multiple local updates before each communication round, and communication compression (Seide et al., 2014; Alistarh et al., 2017; Zhang et al., 2017; Lim et al., 2018; Alistarh et al., 2018; Lin et al., 2018; Safaryan and Richtárik, 2021), reducing the size of communicated messages via compression operators. 1.2 Contributions In this paper we contribute to a better understanding of the latter approach to alleviating the communication bottleneck: communication compression. In particular, we study the theoretical properties of gradient-type methods which employ biased gradient compression operators, such as Top-k sparsification (Alistarh et al., 2018), or deterministic rounding (Sapio et al., 2019). Surprisingly, current theoretical understanding of such methods is very limited. For instance, there is no general theory of such methods even in the n = 1 case, only a handful of biased compression techniques have been proposed in the literature, we do not have any theoretical understanding of why biased compression operators could outperform their unbiased counterparts and when. More importantly, there is no good convergence theory for any gradient-type method with a biased compression in the crucial n > 1 setting. In this work we address all of the above problems. In particular, our main contributions are: (a) We define and study three parametric classes of biased compression operators (see Section 2), which we denote B1(α, β), B2(γ, β) and B3(δ), the first two of which are new. We prove that, despite they are alternative parameterization of the same collection of operators, the last two are more favorable than the first one, thus highlighting the importance of parametrization and providing further reductions. We show how is the commonly used class of unbiased compression operators, which we denote U(ζ), relates to these biased classes. We also study scaling and compositions of such compressors. (b) We then proceed to give a long list of new and known biased (and some unbiased) compression operators which belong to the above classes in Section 2.2. A summary of all compressors considered can be found in Table 3. (c) In Section 3 we analyze compressed GD in the n = 1 case for compressors belonging to all three classes under smoothness and strong convexity assumption. Our theorems generalize existing results which hold for unbiased operators in a tight manner, and also recover the rate of GD in this regime. Our linear convergence results are summarized in Table 1. (d) We ask the question: do biased compressors outperform their unbiased counterparts in theory, and by how much? We answer this question by studying the performance of compressors under various synthetic and empirical statistical assumptions on the distribution of the entries of gradient vectors which need to be compressed. Particularly, we quantify the gains of the Top-k sparsifier when compared against the unbiased Rand-k sparsifier in Section 4. Here we refer to the initial online appearance of our work on February of 2020, after which several enhancements were developed. See Section 1.3 for more details. Beznosikov and Horváth and Richtárik and Safaryan Compressor C B1(α, β) C B2(γ, β) C B3(δ) Theorem Theorem 17 Theorem 18 Theorem 19 Complexity O β2 α L µ log 1 γ L µ log 1 Table 1: Complexity results for GD with biased compression. The identity compressor C(x) x belongs to all classes with α = β = γ = δ = 1; all three results recover standard rate of GD. Stepsizes Weights Rate k) O(k) O A1 O(1) O(e k) O A3 exp K O(1) 1 O A3 Table 2: Ergodic convergence of distributed SGD with biased compression and error-feedback (Algorithm 1) for L-smooth and µ-strongly convex functions (K communications). Details are given in Theorem 21. (e) Finally, we study the important n > 1 setting in Section 5 and argue by giving a counterexample that a naive application of biased compression to distributed GD might diverge. We then show that distributed SGD method equipped with an error-feedback mechanism (Stich and Karimireddy, 2019) provably handles biased compressors. In our main result (Theorem 21; also see Table 2) we consider three learning schedules and iterate averaging schemes to provide three distinct convergence rates. Our analysis provides the first convergence guarantee for distributed gradient-type method which provably converges for biased compressors, and we thus solve a major open problem in the literature. 1.3 Related work There has been extensive work related to communication compression, mostly focusing on unbiased compressions (Alistarh et al., 2017) as these are much easier to analyze. In particular, it was shown (Gorbunov et al., 2020a) that both the classical method with unbiased compression (Alistarh et al., 2017) and more advanced modifications (Mishchenko et al., 2019; Horváth et al., 2019b) can be considered as special versions of SGD. Subsequently, the results of (Gorbunov et al., 2020a) for strongly convex problems were transferred to general convex (Khaled et al., 2020b) and non-convex (Li and Richtárik, 2020) target functions. In the meantime, works concerning biased compressions show stronger empirical results but with limited or no analysis (Vogels et al., 2019; Lin et al., 2017a; Sun et al., 2019). There have been several attempts trying to address this issue, e.g., Wu et al. (2018) provided analysis for quadratics in distributed setting, Zhao et al. (2019) gave analysis for momentum On Biased Compression for Distributed Learning SGD with a specific biased compression, but under unreasonable assumptions, i.e., bounded gradient norm and memory. The first result that obtained linear rate of convergence for biased compression was done by Karimireddy et al. (2019b), but only for one node and under bounded gradient norm assumption, which was later overcome by Stich and Karimireddy (2019). After the initial online appearance of our work, there has been several enhancements in the literature. In particular, Ajalloeian and Stich (2021) developed theory for nonconvex objectives in the single node setup, Gorbunov et al. (2020b) designed a novel error compansated SGD algorithm converging linearly in a more relaxed setting with the help of additional unbiased compressor, Horváth and Richtárik (2021) proposed a simple trick to convert any biased compressor to corresponding induced (unbiased) compressor leading to improved theoretical guarantees. Recently, a new variant of error feedback mechanism was introduced in (Richtárik et al., 2021; Fatkhullin et al., 2021) showing an improved rates for distributed non-convex problems. 1.4 Basic notation and definitions We use x, y := Pd i=1 xiyi to denote standard inner product of x, y Rd, where xi corresponds to the i-th component of x in the standard basis in Rd. This induces the ℓ2-norm in Rd in the following way x 2 := p x, x . We denote ℓp-norms as x p := (Pd i=1 |xi|p) 1/p for p (1, ). By E [ ] we denote mathematical expectation. For a given differentiable function f : Rd R, we say that it is L-smooth if f(y) f(x) + f(x), y x + L 2 y x 2 2 , x, y Rd. We say that it is µ-strongly convex if f(y) f(x) + f(x), y x + µ 2 y x 2 2 , x, y Rd. 2. Biased Compressors By compression operator we mean a (possibly random) mapping C : Rd Rd with some constraints. Typically, literature considers unbiased compression operators C with a bounded second moment, i.e. Definition 1 Let ζ 1. We say that C U(ζ) if C is unbiased (i.e., E [C(x)] = x for all x) and if the second moment is bounded as E h C(x) 2 2 i ζ x 2 2 , x Rd . (2) 2.1 Three classes of biased compressors We instead focus on understanding biased compression operators, or compressors in short. We now introduce three classes of biased compressors, the first two are new, which can be seen as natural extensions of unbiased compressors. (2) can be also written as E C(x) x 2 2 (ζ 1) x 2 2. Beznosikov and Horváth and Richtárik and Safaryan Definition 2 We say that C B1(α, β) for some α, β > 0 if α x 2 2 E h C(x) 2 2 i β E [C(x)] , x , x Rd . (3) As we shall show next, the second inequality in (3) implies E h C(x) 2 2 i β2 x 2 2. Lemma 3 For any x Rd, if E h C(x) 2 2 i β E [C(x)] , x , then E h C(x) 2 2 i β2 x 2 2 . (4) Proof Fix any x Rd. Applying Jensen s inequality, the second inequality in (3) and Cauchy-Schwarz, we get E [C(x)] 2 2 E h C(x) 2 2 i (3) β E [C(x)] , x β E [C(x)] 2 x 2 . (5) If E [C(x)] = 0, this implies E [C(x)] 2 β x 2. Plugging this back into (5), we get (4). If E [C(x)] = 0, then from (3) we see that E h C(x) 2 2 i = 0, and (4) holds trivially. In the second class, we require the inner product between uncompressed x and compressed C(x) vectors to dominate the squared norms of both vectors in expectation. Definition 4 We say that C B2(γ, β) for some γ, β > 0 if max n γ x 2 2 , 1 βE h C(x) 2 2 io E [C(x)] , x x Rd . (6) Finally, in the third class, we require the compression error C(x) x 2 2 to be strictly smaller than the squared norm x 2 2 of the input vector x in expectation. Definition 5 We say C B3(δ) for some δ > 0 if E h C(x) x 2 2 i 1 1 x 2 2 , x Rd . (7) This last definition was also considered by Stich et al. (2018); Cordonnier (2018). All three definitions require the compressed vector C(x) to be in the neighborhood of the uncompressed vector x so that initial information is preserved with some accuracy. We now establish several basic properties and connections between the classes. We first show that the three classes of biased compressors defined above are equivalent in the following sense: a compressor from any of those three classes can be shown to belong to all three classes with different parameters and after possible scaling. Theorem 6 (Equivalence between biased compressors) Let λ > 0 be a free scaling parameter. 1. If C B1(α, β), then On Biased Compression for Distributed Learning (i) β2 α and λC B1(λ2α, λβ), and (ii) C B2(α/β, β) and 1 βC B3(β2/α). 2. If C B2(γ, β), then (i) β γ and λC B2(λγ, λβ), and (ii) C B1(γ2, β) and 1 βC B3(β/γ). 3. If C B3(δ), then (i) δ 1, and (ii) C B2 1 Proof Let us prove this implications for each class separately. 1. Case C B1(α, β): (i) Let us choose any x = 0 and observe that (3) implies that E [C(x)] = 0. Further, from (3) we get the bounds E h C(x) 2 2 i E [C(x)] , x β, α E h C(x) 2 2 i E h C(x) 2 2 i E [C(x)] , x E h C(x) 2 2 i E h C(x) 2 2 i E [C(x)] 2 2 x 2 2 α E h C(x) 2 2 i E [C(x)] 2 2 α, where the second inequality is due to Cauchy-Schwarz, and the last inequality follows by applying Jensen inequality. The scaling property λC B1(αλ2, βλ) follows directly from (3). (ii) The inclusion C B2(α/β, β) follows directly from inequalities (3) and (6). In view of (i), λC B1(λ2α, λβ). If we choose λ 2 E h λC(x) x 2 2 i = E h λC(x) 2 2 i 2 E [λC(x)] , x + x 2 2 (3) (βλ 2) E [λC(x)] , x + x 2 2 (3) (βλ 2)αλ2 βλ x 2 2 + x 2 2 β λ + 1 x 2 2 . Minimizing the above expression in λ, we get λ = 1 β, and the result follows. 2. Case C B2(γ, β). Beznosikov and Horváth and Richtárik and Safaryan (i) Using (6) we get γ E [C(x)] , x x 2 2 E h C(x) 2 2 i E h C(x) 2 2 i x 2 2 β E [C(x)] , x r E h C(x) 2 2 i x 2 2 where the first and third inequalities follow from (6) and the third and the last from Cauchy-Schwarz inequality with Jensen inequality. The scaling property λC B2(λγ, λβ) follows directly from (6). (ii) If C B2(γ, β), then E h C(x) 2 2 i β E [C(x)] , x and (6) E [C(x)] , x 2 E [C(x)] 2 2 x 2 2 E h C(x) 2 2 i x 2 2 , where the second inequality is Cauchy-Schwarz, and the third is Jensen. Therefore, C B1(γ2, β). Further, for any λ > 0, we get E h λC(x) x 2 2 i = E h λC(x) 2 2 i 2 E [λC(x)] , x + x 2 2 = λ2E h C(x) 2 2 i 2λ E [C(x)] , x + x 2 2 (6) (λβ 2)λ E [C(x)] , x + x 2 2 . If we choose λ = 1 β, then we can continue as follows: E h λC(x) x 2 2 i 1 β E [C(x)] , x + x 2 2 βC B3(β/γ). 3. Case C B3(δ). (i) Pick x = 0. Since 0 E h C(x) x 2 2 i 1 1 δ x 2 2 and we assume δ > 0, we must necessarily have δ 1. (ii) If C B3(δ) then E h C(x) 2 2 i 2 E [C(x)] , x + 1 which implies that 1 2δ x 2 2 E [C(x)] , x and E h C(x) 2 2 i 2 E [C(x)] , x . Therefore, C B2 1 On Biased Compression for Distributed Learning Next, we show that, with a proper scaling, any unbiased compressor also belongs to all the three classes of biased compressors. Theorem 7 (From unbiased to biased with scaling) If C U(ζ), then for the scaled operator λC we have (i) λC B1(λ2, λζ) for λ > 0, (ii) λC B2(λ, λζ) for λ > 0, (iii) λC B3 1 λ(2 ζλ) for ζλ < 2. Proof Let C U(ζ). (i) Given any λ > 0, consider the scaled operator λC. We have λ2 x 2 2 = E [λC(x)] 2 2 E h λC(x) 2 2 i λ2ζ x 2 2 = λζ E [λC(x)] , x , whence C B1(λ2, λζ). (ii) Given any λ > 0, consider the scaled operator λC. We have λ x 2 2 = E [λC(x)] , x , E h λC(x) 2 2 i λ2ζ x 2 2 = λζ E [λC(x)] , x , whence λC B2(λ, λζ). (iii) Given λ > 0 such that λζ < 2, consider the scaled operator λC. We have E h λC(x) x 2 2 i = E h λC(x) 2 2 i 2 E [λC(x)] , x + x 2 2 (ζλ2 2λ + 1) x 2 2 whence λC B3 1 λ(2 ζλ) . 2.2 Examples of biased compressors: old and new We now give some examples of compression operators belonging to the classes B1, B2, B3 and U. Several of them are new. For a summary, refer to Table 3. Beznosikov and Horváth and Richtárik and Safaryan Compression Operator C Unbiased? α β γ δ ζ Unbiased random sparsification d/k Biased random sparsification [NEW] q 1 q 1/q Adaptive random sparsification [NEW] 1/d 1 1/d d Top-k sparsification (Alistarh et al., 2018) k/d 1 k/d d/k General unbiased rounding [NEW] 1 4 sup ak ak+1 + ak+1 Unbiased exponential rounding [NEW] 1 4 b + 1 Biased exponential rounding [NEW] 2 b+1 2 2b b+1 2 b+1 (b+1)2 4b Natural compression (Horváth et al., 2019a) 9/8 General exponential dithering [NEW] ζb Natural dithering (Horváth et al., 2019a) ζ2 Top-k + exponential dithering [NEW] k/d ζb k/d ζbd/k Table 3: Compressors C described in Section 2.2 and their membership in B1(α, β), B2(γ, β), B3(δ) and U(ζ). (a) For k [d] := {1, . . . , d}, the unbiased random (aka Rand-k) sparsification operator is defined via i S xiei, (8) where S [d] is the k-nice sampling; i.e., a subset of [d] of cardinality k chosen uniformly at random, and e1, . . . , ed are the standard unit basis vectors in Rd. Lemma 8 The Rand-k sparsifier (8) belongs to U( d (b) Let S [d] be a random set, with probability vector p := (p1, . . . , pd), where pi := Prob(i S) > 0 for all i (such a set is called a proper sampling (Richtárik and Takáč, 2016)). Define biased random sparsification operator via i S xiei. (9) Lemma 9 Letting q := mini pi, the biased random sparsification operator (9) belongs to B1(q, 1), B2(q, 1), B3(1/q). (c) Adaptive random sparsification is defined via C(x) := xiei with probability |xi| x 1 . (10) Lemma 10 Adaptive random sparsification operator (10) belongs to B1( 1 d, 1), B2( 1 d, 1), B3(d). (d) Greedy (aka Top-k) sparsification operator is defined via i=d k+1 x(i)e(i), (11) where coordinates are ordered by their magnitudes so that |x(1)| |x(2)| |x(d)|. On Biased Compression for Distributed Learning Lemma 11 Top-k sparsification operator (11) belongs to B1(k d, 1), B2(k d, 1), and B3( d (e) Let (ak)k Z be an arbitrary increasing sequence of positive numbers such that inf ak = 0 and sup ak = . Then general unbiased rounding C is defined as follows: if ak |xi| ak+1 for some coordinate i [d], then ( sign(xi)ak with probability ak+1 |xi| ak+1 ak sign(xi)ak+1 with probability |xi| ak ak+1 ak (12) Lemma 12 General unbiased rounding operator (12) belongs to U(ζ), where ak+1 + ak+1 Notice that ζ is minimizing for exponential roundings ak = bk with some basis b > 1, in which case ζ = 1 4 (b + 1/b + 2). (f) Let (ak)k Z be an arbitrary increasing sequence of positive numbers such that inf ak = 0 and sup ak = . Then general biased rounding is defined via C(x)i := sign(xi) arg min t (ak) |t |xi||, i [d]. (13) Lemma 13 General biased rounding operator (13) belongs to B1(α, β), B2(γ, β), and B3(δ), where β = sup k Z 2ak+1 ak + ak+1 , γ = inf k Z 2ak ak + ak+1 , α = γ2, δ = sup k Z (ak + ak+1)2 In the special case of exponential rounding ak = bk with some base b > 1, we get α = 2 b + 1 2 , β = 2b b + 1, γ = 2 b + 1, δ = (b + 1)2 Remark 14 Plugging these parameters into the iteration complexities of Table 1, we find that the class B3 gives the best iteration complexity as β2 γ = b > δ = (g) Natural compression operator Cnat of Horváth et al. (2019a) is the special case of general unbiased rounding operator (12) when b = 2. So, Beznosikov and Horváth and Richtárik and Safaryan (h) For b > 1, define general exponential dithering operator with respect to lp-norm and with s exponential levels 0 < b1 s < b2 s < < b 1 < 1 via C(x) := x p sign(x) ξ |xi| where the random variable ξ(t) for t [b u 1, b u] is set to either b u 1 or b u with probabilities proportional to b u t and t b u 1, respectively. Lemma 15 General exponential dithering operator (14) belongs to U(ζb), where, letting r = min(p, 2), b + 2 + d 1 r b1 s min(1, d 1 r b1 s). (15) (i) Natural dithering introduced by Horváth et al. (2019a) without norm compression is the spacial case of general exponential dithering (14) when b = 2. (j) Ternary quantization of Wen et al. (2017) is the extreme case of general exponential dithering (14) with s = 1 levels and b = 1. (k) Top-k combined with exponential dithering. Let Ctop be the Top-k sparsification operator (11) and Cdith be general exponential dithering operator (14) with some base b > 1 and parameter ζb from (15). Define a new compression operator as the composition of these two: C(x) := Cdith (Ctop(x)) . (16) Lemma 16 The composition operator (16) of Top-k sparsification and exponential dithering with base b belongs to B1(k d, ζb), B2(k d, ζb), B3( d kζb), where ζb is as in (15). 3. Gradient Descent with Biased Compression As we discussed in previous section, compression operators can have different equivalent parametrizations. Next, we aim to investigate the influence of those parametrizations on the theoretical convergence rate of an algorithm employing compression operators. To achieve clearer understanding of the interaction of compressor parametrization and convergence rate, we first consider the single node, unconstrained optimization problem min x Rd f(x), where f : Rd R is L-smooth and µ-strongly convex. We study the method xk+1 = xk ηCk( f(xk)) , (CGD) where Ck : Rd Rd are (potentially biased) compression operators belonging to one of the classes B1, B2 and B3 studied in the previous sections, and η > 0 is a stepsize. We refer to this method as CGD: Compressed Gradient Descent. On Biased Compression for Distributed Learning 3.1 Complexity theory We now establish three theorems, performing complexity analysis for each of the three classes B1, B2 and B3 individually. Let Ek := E f(xk) f(x ), with E0 = f(x0) f(x ). Theorem 17 Let C B1(α, β). Then as long as 0 η 2 βL, we have Ek 1 α β ηµ(2 ηβL) Ek 1. If we choose η = 1 βL, then Theorem 18 Let C B2(γ, β). Then as long as 0 η 2 βL, we have Ek (1 γη (2 ηβ)L)) Ek 1. If we choose η = 1 βL, then Theorem 19 Let C B3(δ). Then as long as 0 η 1 L, we have Ek 1 1 δηµ k Ek 1. If we choose η = 1 The iteration complexity for these results can be found in Table 1. Note that the identity compressor C(x) x belongs to B1(1, 1), B2(1, 1), and B3(1), hence all these result exactly recover the rate of GD. In the first two theorems, scaling the compressor by a positive scalar λ > 0 does not influence the rate (see Theorem 6). One can note that it is possible to adapt the results of Theorems 17-19 to the case with the time-varying parameters of compression operators (Lin et al., 2017b; Agarwal et al., 2020). In the case of Theorem 19 and B3 compressors, it is sufficient to introduce a sequence {δi} responsible for the changes of the parameter δ and obtain the following convergence: Ek h Qk 1 i=0 1 1 δi µ L i E0. In the case of Theorems 17 and 18, the results are similar, but it is necessary to add in the algorithm (CGD) the possibility of using time-varying steps ηk = 1 βk L. 3.2 B3 and B2 are better than B1 In light of the results above, we make the following observation. If C B1(α, β), then by Theorem 6, 1 α ). Applying Theorem 19, we get the bound O β2 α L µ log 1 ε . This is the same result as that obtained by Theorem 17. On the other hand, if C B3(δ), then by Theorem 6, C B1( 1 4δ2 , 2). Applying Theorem 17, we get the bound O 16δ2 L ε . This is a worse result than what Theorem 19 offers by a factor of 16δ. Similarly, if C B1(α, β), then by Theorem 6, C B2(α β , β). Applying Theorem 18, we get the bound O β2 α L µ log 1 ε . This is the same result as that obtained by Theorem 17. On the other hand, if C B2(γ, β), then by Theorem 6, C B1(γ2, β). Applying Theorem 17, Beznosikov and Horváth and Richtárik and Safaryan we get the bound O β2 γ2 L ε . This is a worse result than what Theorem 18 offers by a factor of β γ 1. Hence, while B1, B2 and B3 describe the same classes of compressors, for the purposes of CGD it is better to parameterize them as members of B2 or B3. 4. Superiority of Biased Compressors Under Statistical Assumptions Here we highlight some advantages of biased compressors by comparing them with their unbiased cousins. We evaluate compressors by their average capacity of preserving the gradient information or, in other words, by expected approximation error they produce. In the sequel, we assume that gradients have i.i.d. coordinates drawn from some distribution. 4.1 Top-k vs Rand-k We now compare two sparsification operators: Rand-k (8) which is unbiased and which we denote as Ck rnd, and Top-k (11) which is biased and which we denote as Ck top. We define variance of the approximation error of x via ωk rnd(x) := E " k d Ck rnd(x) x ωk top(x) := Ck top(x) x 2 and the energy saving via sk rnd(x) := x 2 2 ωk rnd(x) = E " k d Ck rnd(x) sk top(x) := x 2 2 ωk top(x) = Ck top(x) 2 i=d k+1 x2 (i). Expectations in these expressions are taken with respect to the randomization of the compression operator rather than input vector x. Clearly, there exists x for which these two operators incur identical variance, e.g. x1 = = xd. However, in practice we apply compression to gradients x which evolve in time, and which may have heterogeneous components. In such situations, ωk top(x) could be much smaller than ωk rnd(x). This motivates a quantitative study of the average case behavior in which we make an assumption on the distribution of the coordinates of the compressed vector. Uniform and exponential distribution. We first consider the case of uniform and exponentially distributed entries, and quantify the difference. On Biased Compression for Distributed Learning Lemma 20 Assume the coordinates of x Rd are i.i.d. (a) If they follow uniform distribution over [0, 1], then E ωk rnd = 1 k d + 1 E s1 rnd = 3d d + 2. (b) If they follow standard exponential distribution, then E s1 rnd = 1 = O(log2 d). Top-3 Top-5 d 102 103 104 105 102 103 104 105 N(0; 1) 3 (σ2 + µ2) = 3 5 (σ2 + µ2) = 5 E sk top(x) 18.65 31.10 43.98 57.08 27.14 47.70 69.07 90.85 N(2; 1) 3 (σ2 + µ2) = 15 5 (σ2 + µ2) = 25 E sk top(x) 53.45 75.27 95.81 115.53 81.60 118.56 153.13 186.22 Table 4: Information savings of greedy and random sparsifiers for k = 3 and k = 5. Empirical comparison. Now we compare these two sparsification methods on an empirical bases and show the significant advantage of greedy sparsifier against random sparsifier. We assume that coordinates of to-be-compressed vector are i.i.d. Gaussian random variables. First, we compare the savings sk top and sk rnd of these compressions. For random sparsification, we have E h sk rnd(x) i = k (σ2 + µ2), where µ and σ2 are the mean and variance of the Gaussian distribution. For computing E sk top(x) , we use the probability density function of k-th order statistics (see (31) or (2.2.2) of (Arnold et al., 1992)). Table 4 shows that Top-3 and Top-5 sparsifiers save 3 40 more information in expectation and the factor grows with the dimension. Next we compare normalized variances ωk top(x) x 2 2 and ωk rnd(x) x 2 2 for randomly generated Gaussian vectors. In an attempt to give a dimension independent comparison, we compare them against the average number of encoding bits per coordinate, which is quite stable with respect to the dimension. Figure 1 reveals the superiority of greedy sparsifier against the random one. Practical distribution. We obtained various gradient distributions via logistic regression (mushrooms LIBSVM dataset) and least squares. We used the sklearn package and built Gaussian smoothing of the practical gradient density. The second moments, i.e. energy Beznosikov and Horváth and Richtárik and Safaryan 0 5 10 15 20 25 30 bits per coordinate normalized variance Rand-k Top-k Figure 1: The comparison of Top-k and Rand-k sparsifiers with respect to normalized variance and the number of encoding bits used for each coordinate on average. Each point/marker represents a single d = 104 dimensional vector drawn form Gaussian distribution and then compressed by the specified operator. Each curve was obtained by varying the free parameter k {1, 2, . . . , d}. Plots for different d look very similar. Notice that, for random sparsification the normalized variance is perfectly linear with respect to the number of bit per coordinate. Letting b be the total number of bits to encode the compressed vector (say in binary32 system), the normalized variance produced by random sparsifier is almost 1 b/d 32 . However, greedy sparsifier achieves exponentially lower variance 0.86 b/d utilizing the same amount of bits. 600 400 200 0 200 400 600 800 x (a) s5 rnd = 9 105, s5 top = 28 105 15 10 5 0 5 10 15 x (b) s5 rnd = 624, s5 top = 1357 60 40 20 0 20 40 60 x (c) s5 rnd = 11921, s5 top = 23488 0.020 0.015 0.010 0.005 0.000 0.005 0.010 0.015 0.020 x (d) s5 rnd = 2 10 4, s5 top = 17 10 4 Figure 2: Calculations of the Rand-5 and Top-5 energy saving for practical gradient distributions ((a),(b),(c): quadratic problem, (d): logistic regression). The results of Top-5 are 3 5 better. saving , were already calculated from it by formula for density function of k-order statistics, see Appendix A.4 or (Arnold et al., 1992). We conclude experiments for Top-5 and Rand-5, see Figure 2 for details. 4.2 New compressor: Top-k combined with dithering In Section 2.2 we gave a new biased compression operator (see (16)), where we combined Top-k sparsification operator (see (11)) with the general exponential dithering (see (14)). On Biased Compression for Distributed Learning Consider the composition operator with natural dithering, i.e., with base b = 2. We showed that it belongs to B1(k 8) and B3( 9d 8k). Figure 3 empirically confirms that it attains the lowest compression parameter δ 1 among all other known compressors (see (5)). Furthermore, the iteration complexity O δ L ε of CGD for C B3(δ) implies that it enjoys fastest convergence. 0 1 2 3 4 5 6 7 8 9 10 bits per coordinate Rand-k Top-k St. Dith. Nat. Dith. Nat. Comp. Ternary Quant. Scaled Sign Top-k + St. Dith. Top-k + Nat. Dith. Figure 3: Comparison of various compressors (with and without free design parameters) with respect to the parameter δ 1 in log10 scale and the number of encoding bits used for each coordinate on average. Each point/marker represents a single d = 104 dimensional vector x drawn from Gaussian distribution and then compressed by the specified operator. Each curve was obtained by varying free parameters (k {1, 2, . . . , d} for sparsifiers, s 1 for ditherings) of the specified compressor. Parameter-free compressors, such as ternary quantization and natural compression, have fixed communication budgets which explains the vertical arrangements of the points. 5. Distributed Setting We now focus attention on a distributed setup with n machines, each of which owns non-iid data defining one loss function fi. Our goal is to minimize the average loss: 5.1 Distributed CGD with unbiased compressors Perhaps the most straightforward extension of CGD to the distributed setting is to consider the method xk+1 = xk η 1 i=1 Ck i ( fi(xk)). (DCGD) Indeed, for n = 1 this method reduces to CGD. For unbiased compressors belonging to U(ζ), this method converges under suitable assumptions on the functions. For instance, if fi are L-smooth and f is µ-strongly convex, then as long as the stepsize is chosen appropriately, Beznosikov and Horváth and Richtárik and Safaryan the method converges to a O ηD(ζ 1) µn neighborhood of the (necessarily unique) solution x with the linear rate where D := 1 i=1 fi(x ) 2 2 (Gorbunov et al., 2020a). In particular, in the overparameterized setting when D = 0, the method converges to the exact solution, and does so at the same rate as GD as long as ζ = O(n). These results hold even if a regularizer is considered, and a proximal step is added to DCGD. Moreover, as shown by Mishchenko et al. (2019) and Horváth et al. (2019b), a variance reduction technique can be devised to remove the neighborhood convergence and replace it by convergence to x , at the negligible additional cost of O((ζ 1) log 1 5.2 Failure of DCGD with biased compressors However, as we now demonstrate by giving some counter-examples, DCGD may fail if the compression operators are allowed to be biased. In the first example below, DCGD used with the Top-1 compressor diverges at an exponential rate. Example 1 Consider n = d = 3 and define f1(x) = a, x 2 + 1 4 x 2 2 , f2(x) = b, x 2 + 1 4 x 2 2 , f3(x) = c, x 2 + 1 where a = ( 3, 2, 2), b = (2, 3, 2) and c = (2, 2, 3). Let the starting iterate be x0 = (t, t, t), where t > 0. Then 2( 11, 9, 9), f2(x0) = t 2(9, 11, 9), f3(x0) = t 2(9, 9, 11). Using the Top-1 compressor, we get C( f1(x0)) = t 2( 11, 0, 0), C( f2(x0)) = t 2(0, 11, 0) and C( f3(x0)) = t 2(0, 0, 11). The next iterate of DCGD is i=1 C( fi(x0)) = 1 + 11η Repeated application gives xk = 1 + 11η Since η > 0, the entries of xk diverge exponentially fast to + . The above counter-example can be extended to the case of Top-d1 when d1 < d Example 2 Fix the dimension d 3 and let n = d d1 be the number of nodes, where d1 < d 2 and d2 = d d1 > d1. Choose positive numbers b, c > 0 such that bd1 + cd2 = 1, b > c + 1. On Biased Compression for Distributed Learning One possible choice could be b = d2 + d2 d1 , c = d1 + 1 d2 + 1. Define vectors aj Rd, j [n] via i Ij ( b)ei + X i [d]\Ij cei, where sets Ij [d], j [n] are all possible d1-subsets of [d] enumerated in some way. Define fj(x) = aj, x 2 + 1 2 x 2 2 , j [n] and let the initial point be x0 = te, t > 0, where e = Pd i=1 ei is the vector of all 1s. Then fj(x0) = 2 aj, x0 aj + x0 = 2t( bd1 + cd2) aj + te = t(2aj + e). Since |2( b) + 1| > |2c + 1|, then using the Top-d1 compressor, we get C( fj(x0)) = t(2b 1) X Therefore, the next iterate of DCGD is x1 = x0 η 1 j=1 C( fj(x0)) = x0 + ηt(2b 1) = x0 + η(2b 1) x0 = 1 + η(2b 1)d1 which implies xk = 1 + η(2b 1)d1 Since η > 0 and b > 1, the entries of xk diverge exponentially fast to + . Finally, we present more general counter-example with different type of failure for DCGD when non-randomized compressors are used. Example 3 Let C : Rd Rd be any deterministic mapping for which there exist vectors v1, v2, . . . , vm Rd such that i=1 vi = 0, and i=1 C(vi) = 0. (18) Consider the distributed optimization problem (17) with n = m devices and with the following strongly convex loss functions fi(x) = vi, x + 1 2 x 2, i [m]. Beznosikov and Horváth and Richtárik and Safaryan Then fi(x) = vi + x and f(x) = 1 n Pn i=1 vi + x. Hence, the optimal point x = 1 n Pn i=1 vi = 0. However, it can be easily checked that, with initialization x0 = 0, we have x1 = x0 γ 1 i=1 C( fi(x0)) = x0 γ 1 i=1 C(vi + x0) = x0. Thus, when initialized at x0 = 0, not only DCGD does not converge to the solution x = 0, it remains stuck at the same initial point for all iterates, namely xk = x0 = 0 for all k 1. Condition (18) can be easily satisfied for specific biased compressors. For instance, Top-1 satisfies (18) with v1 = [ 1 4 ], v2 = 1 2 , v3 = 1 2 . The above examples suggests that one needs to devise a different approach to solving the distributed problem (17) with biased compressors. We resolve this problem by employing a memory feedback mechanism. 5.3 Error Feedback We show that distributed version of Distributed SGD wtih Error-Feedback (Karimireddy et al., 2019b), displayed in Algorithm 1, is able to resolve the issue. Moreover, this algorithm allows for the computation of stochastic gradients. Each step starts with all machines i in parallel computing a stochastic gradient gk i of the form gk i = fi(xk) + ξk i , (19) where fi(xk) is the true gradient, and ξk i is a stochastic error. Then, this is multiplied by a stepsize ηk and added to the memory/error-feedback term ek i , and subsequently compressed. The compressed messages are communicated and aggregated. The difference of message we wanted to send and its compressed version becomes stored as ek+1 i for further correction in the next communication round. The output x K is an ergodic average of the form x K := 1 W K k=0 wkxk, W K := k=0 wk. (20) 5.4 Complexity theory We assume the stochastic error ξk i in (19) satisfies the following condition. Assumption 1 Stochastic error ξk i is unbiased, i.e. E ξk i = 0, and for some constants B, C 0 2 + C, i [n], k 0. (24) Note that this assumption is much weaker than the bounded variance assumption (i.e., E h ξk i 2 2 i C) and bounded gradient assumption (i.e., E h gk i 2 2 i C). We can now state the main result of this section. To the best of our knowledge, this was an open problem: we are not aware of any convergence results for distributed optimization that tolerate general classes of biased compression operators and have reasonable assumptions on the stochastic gradient. On Biased Compression for Distributed Learning Algorithm 1 Distributed SGD with Biased Compression and Error Feedback Parameters: Compressors Ck i B3(δ); Stepsizes {ηk}k 0; Iteration count K Initialization: Choose x0 Rd and e0 i = 0 for all i for k = 0, 1, 2, . . . , K do Server sends xk to all n machines All machines in parallel perform these updates: gk i = Ck i (ek i + ηkgk i ) (21) ek+1 i = ek i + ηkgk i gk i (22) Each machine i sends gk i to the server Server performs aggregation: xk+1 = xk 1 i=1 gk i (23) end for Output: Weighted average of the iterates: x K (20) Theorem 21 (Convergence guarantees for Algorithm 1) Let {xk}k 0 denote the iterates of Algorithm 1 for solving problem (1), where each fi is L-smooth and µ-strongly convex. Let x be the minimizer of f and let f := f(x ) and i=1 fi(x ) 2 2 . Assume the compression operator used by all nodes is in B3(δ). Then we have the following convergence rates under three different stepsize and iterate weighting regimes: k) stepsizes & O(k) weights. Let, for all k 0, the stepsizes and weights be set as ηk = 4 µ(κ+k) and wk = κ + k, respectively, where κ = 56(2δ+B)L E f( x K) f = O A1 where A1 := L2(2δ+B)2 µ x0 x 2 2 and A2 := C(1+1/n)+D(2B/n+3δ) (ii) O(1) stepsizes & O(e k) weights. Let, for all k 0, the stepsizes and weights be set as ηk = η and wk = (1 µη/2) (k+1), respectively, where η 1 14(2δ+B)L. Then E f( x K) f = O A3 exp K where A3 := L(2δ + B) x0 x 2 2 and A4 := 28L(2δ+B) Beznosikov and Horváth and Richtárik and Safaryan (iii) O(1) stepsizes & equal weights. Let, for all k 0, the stepsizes and weights be set as ηk = η and wk = 1, respectively, where η 1 14(2δ+B)L. Then E f( x K) f = O A3 where A5 := p C (1 + 1/n) + D (2B/n + 3δ) x0 x 2. Let us make a few observations on these results. First, Algorithm 1 employing general biased compressors and error feedback mechanism indeed resolves convergence issues of DCGD method by converging the optimal solution x . Second, note that the choice of stepsizes ηk and weights wk leading to convergence is not unique and several schedules are feasible. Third, all the rates are sublinear and based on the second rate (ii) above, linear convergence is guaranteed if C = D = 0. Based on (24), one setup when the condition C = 0 holds is when all devices compute full local gradients (i.e., gk i = fi(xk)). Furthermore, the condition D = 0 is equivalent to fi(x ) = 0 for all i [n], which is typically satisfied for over-parameterized models. Lastly, under these two assumptions (i.e., devices can compute full local gradients and the model is over-parameterized), we show that distributed SGD method with error feedback converges with the same O δ L ϵ linear rate as single node CGD algorithm. To the best of our knowledge, this was the first regime where distributed first order method with biased compression is guaranteed to converge linearly. 0 20 40 60 80 100 epoch VGG19 on train Top-5 with error Top-5 without error Random-5 without error SGD 0 20 40 60 80 100 epoch VGG19 on train Top-5 with error Top-5 without error SGD 0 20 40 60 80 100 epoch VGG19 on test Top-5 with error Top-5 without error Random-5 without error SGD 0 20 40 60 80 100 epoch VGG19 on test Top-5 with error Top-5 without error SGD Figure 4: Training/Test loss and accuracy for VGG19 on CIFAR10 distributed among 4 nodes for 4 different compression operators. 6. Experiments In Sections 6.1 6.4, we present our experiments, which are primarily focused on supporting our theoretical findings. Therefore, we simulate these experiments on one machine which On Biased Compression for Distributed Learning 0 10 20 30 40 50 epochs training loss SGD Rand-5 Top-5 Nat. Dith. Top-5 + Nat. Dith. 219 222 225 228 231 234 237 240 bits training loss SGD Rand-5 Top-5 Nat. Dith. Top-5 + Nat. Dith. 0 10 20 30 40 50 epochs test accuracy SGD Rand-5 Top-5 Nat. Dith. Top-5 + Nat. Dith. 219 222 225 228 231 234 237 240 bits test accuracy SGD Rand-5 Top-5 Nat. Dith. Top-5 + Nat. Dith. Figure 5: Training loss and test accuracy for VGG11 on CIFAR10 distributed among 4 nodes for 5 different compression operators. enable us to do rapid direct comparisons against the prior methods. In more details, we use the machine with 24 Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz cores and GPU Ge Force GTX 1080 Ti. Section 6.5 is devoted to real experiments with a large model and big data. For these experiments, we use a computational cluster with 10 GPUs Tesla T4. We implement all methods in Python 3.7 using Pytorch Paszke et al. (2019). 6.1 Lower empirical variance induced by biased compressors during deep network training Motivated by our theoretical results in Section 4, we show that similar behaviour can be seen in the empirical variance of gradients. We run 2 sets of experiments with Resnet18 on CIFAR10 dataset. In Figure 6, we display empirical variance, which is obtained by running a training procedure with specific compression. We compare unbiased and biased compressions with the same communication complexities deterministic with classic/unbiased Cnat and Top-k with Rand-k with k to be 1/5 of coordinates. One can clearly see, that there is a gap in empirical variance between biased and unbiased methods, similar to what we have shown in theory, see Section 4. 6.2 Error-feedback is needed in distributed training with biased compression The next experiment shows the need of error-feedback for methods with biased compression operators. Based on Example 1, error feedback is necessary to prevent divergence from the optimal solution. Figure 4 displays training/test loss and accuracy for VGG19 on CIFAR10 with data equally distributed among 4 nodes. We use plain SGD with a default step size equal to 0.01 for all methods, i.e. Top-5 with and without error feedback, Rand-5 and no compression. As suggested by the counterexample, not using error feedback can really hurt Beznosikov and Horváth and Richtárik and Safaryan 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 variance 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 variance 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 variance 0.0 0.2 0.4 0.6 0.8 1.0 variance 0.0 0.2 0.4 0.6 0.8 1.0 variance 0.0 0.2 0.4 0.6 0.8 1.0 variance Figure 6: Comparison of empirical variance C(x) x 2 2 / x 2 2 during training procedure for two pairs of method deterministic with classic/unbiased Cnat and Top-k with Rand-k with Top-1/5 of coordinates. Both of the plots were produced using Res Net18, Google Net, and VGG19 on CIFAR10 dataset. the performance when biased compressions are used. Also note, that performance of Rand-5 is significantly worse than Top-5. 6.3 Top-k mixed with natural dithering saves in communication significantly Next, we experimentally show the superiority of our newly proposed compressor Top-k combined with natural dithering. We compare this against current state-of-the-art for low bandwidth approach Top-k for some small k. In Figure 5, we plot comparison of 5 methods Top-k, Rand-k, natural dithering, Top-k combined with natural dithering and plain SGD. We use 2 levels with infinity norm for natural dithering and k = 5 for sparsification methods. For all the compression operators, we train VGG11 on CIFAR10 with plain SGD as an optimizer and default step size equal to 0.01. We can see that adding natural dithering after Top-k has the same effect as the natural dithering comparing to no compression, which is a significant reduction in communications without almost no effect on convergence or generalization. Using this intuition, one can come to the conclusion that Top-k with natural dithering is the best compression operator for any bandwidth, where we adjust to given bandwidth by adjusting k. This exactly matches with our previous theoretical variance estimates displayed in Figure 3. 6.4 Theoretical behavior predicts the actual performance in practice In the next experiment, we provide numerical results to further show that our predicted theoretical behavior matches the actual performance observed in practice. We run two regression experiments optimized by gradient descent with step-size η = 1 L. We use a slightly On Biased Compression for Distributed Learning adjusted version of Theorem 19 with adaptive step-sizes, namely f(xk) f(x ) f(x0) f(x ) C( f(xi)) f(xi) 2 2 f(xi) 2 2 . Note that this is the direct consequence of our analysis. We apply this property to display the theoretical convergence. In the first experiment depicted in Figure 7, we randomly generate random square matrix A of dimension 100 where it is constructed in the following way: we sample random diagonal matrix D, which elements are independently sampled from the uniform distribution (1, 10), (1, 100), and (1, 1000), respectively. A is then constructed using Q DQ, where P = QR is a random matrix and QR is obtained using QR-decomposition. The label y is generated the same way from the uniform distribution (0, 1). The optimization objective is then min x Rd x Ax y x. For the second experiment shown in Figure 8, we run standard linear regression on two scikit-learn datasets Boston and Diabetes and applied data normalization as the preprocessing step. 0 5000 10000 15000 20000 25000 30000 Iteration number, k f(xk) f(x * ) f(x0) f(x * ) Quadratic, = 10 Top-5 Top-5 theory Rand-5 Rand-5 theory 0 10000 20000 30000 40000 50000 Iteration number, k f(xk) f(x * ) f(x0) f(x * ) Quadratic, = 100 Top-5 Top-5 theory Rand-5 Rand-5 theory 15000 25000 35000 45000 Iteration number, k f(xk) f(x * ) f(x0) f(x * ) Quadratic, = 1000 Top-5 Top-5 theory Rand-5 Rand-5 theory Figure 7: Theoretical vs. Practical Convergence of Compressed Gradient Descent on Quadratics problem with different condition number κ for Top-5 and Rand-5 compression operators. Looking into Figures 7 and 8, one can clearly see that as predicted by our theory, biased compression with less empirical variance leads to better convergence in practice and the gap almost matches the improvement. 6.5 Transformer training In the last experiment, we work with a real big model. In particular, we train ALBERTlarge (Lan et al., 2020) (18M parameters) with layer sharing on a combination of Bookcorpus (Zhu et al., 2015) and Wikipedia (Devlin et al., 2018) datasets. We use the same optimizer (LAMB) and the same tuning for it as in the original paper (Lan et al., 2020). Our goal is to find an unbiased and biased operators such that we maximize the improvement in terms of communication cost without losing much in terms of training quality. In this case, Beznosikov and Horváth and Richtárik and Safaryan 0 1000 2000 3000 4000 5000 Iteration number, k f(xk) f(x * ) f(x0) f(x * ) Linear regression(Boston) Top-5 Top-5 theory Rand-5 Rand-5 theory 0 5000 10000 15000 20000 Iteration number, k f(xk) f(x * ) f(x0) f(x * ) Linear regression(Diabetes) Top-5 Top-5 theory Rand-5 Rand-5 theory Figure 8: Theoretical vs. Practical Convergence of Compressed Gradient Descent on Linear Regression problem for Boston and Diabetes datasets with Top-5 and Rand-5 compression operators. we include in the communication cost both the time to perform the communication round and the time to perform the compression and decompression operations. It is important to note that we do not compress packages with gradients corresponding to Layer Norm scales, but this is less than 1 percent of the whole package. Among unbiased compressors, we try natural compression (Section 2.2 (g)) and random sparsification (Section 2.2 (a)). The best result is shown by natural compression, which compresses the packages by a factor of 4. The Rand-25 operator (which also compresses the information by a factor of 4) performs much worse even with the use of the error feedback technique. For communications with natural compression, we use the classical allreduce procedure. Among unbiased compressors, we try Top-k sparsification (Section 2.2 (d)) and Power compression (Vogels et al., 2019). The best result is shown by Power compression with the rank parameter r=8 and the error feedback. The organization of communications (allreduce procedures) occurs as in the original paper (Vogels et al., 2019). We measure how the training loss changes (Figure 9) as well as at the end of training we evaluate the final performance for each model on several popular tasks from (Wang et al., 2018) (Table 5). The results show that the use of biased compression can significantly reduce the communication time cost compared to uncompressed and even unbiased compression setups. At the same time, the quality of the training does not drop much. 0 20 40 60 80 Tokens, billions Objective loss ALBERT training without compression natural compression power compression 40 60 80 Tokens, billions Objective loss ALBERT training (zoomed) without compression natural compression power compression Figure 9: Training objective value for ALBERT-large with different compression operators. On Biased Compression for Distributed Learning Setup Avg time Co LA MNLI MRPC QNLI QQP RTE SST2 STS-B Without compression 9.62 0.03 46.2 81.1 82.5 87.9 88.0 66.3 85.1 88.0 Natural compression 4.05 0.05 48.3 81.0 87.5 87.8 84.4 63.2 87.8 86.9 Power compression 1.04 0.04 42.4 80.3 85.1 88.2 85.3 46.3 88.0 87.4 Table 5: Comparison of average time per one communication round and downstream evaluation scores on GLUE benchmark tasks. Appendix A. Basic Facts and Inequalities A.1 Strong convexity Function f is strongly convex on Rd when it is continuously differentiable and there is a constant µ > 0 such that the following inequality holds: 2 x y 2 2 f(x) f(y) f(y), x y , x, y Rd. (25) A.2 Smoothness Function f is called L-smooth in Rd with L > 0 when it is differentiable and its gradient is L-Lipschitz continuous, i.e. f(x) f(y) 2 L x y 2 , x, y Rd. If convexity is assumed as well, then the following inequalities hold: 1 2L f(x) f(y) 2 2 f(x) f(y) f(y), x y , x, y Rd (26) By plugging y = x to (26), we get f(x) 2 2 2L(f(x) f(x )), x Rd. (27) A.3 Useful inequalities For all a, b, x1, . . . , xn Rd and ξ > 0 the following inequalities holds: 2 a, b a 2 2 ξ + ξ b 2 2 , (28) a + b 2 2 1 + 1 a 2 2 + (1 + ξ) b 2 2 , (29) i=1 xi 2 2 . (30) Beznosikov and Horváth and Richtárik and Safaryan A.4 Facts from order statistics For i.i.d. samples x1, x2, . . . , xd from an absolutely continuous distribution with probability density function φ and cumulative distribution function Φ let x(1) x(2) x(d) be the order statistics obtained by arranging samples in increasing order of magnitude. Then the following expressions give the density function of x(i) (1 i d) φx(i)(u) = d! (i 1)!(d i)!{Φ(u)}i 1{1 Φ(u)}n iφ(u), < u < , (31) and the joint density function of all d order statistics φx(1),...,x(d)(u1, . . . , ud) = d! i=1 φ(ui), < u1 ud < . (32) Appendix B. Proofs for Section 2.2 B.1 Proof of Lemma 8: Unbiased Random Sparsification From the definition of k-nice sampling we have pi := Prob (i S) = k E [C(x)] = d i=1 pixiei = i=1 xiei = x, E h C(x) 2 2 i = d2 i=1 pix2 i = d i=1 x2 i = d which implies C U( d B.2 Proof of Lemma 9: Biased Random Sparsification Let S [d] be a proper sampling with probability vector p = (p1, . . . , pd), where pi := Prob(i S) > 0 for all i. Then E [C(x)] = Diag (p) x = i=1 pixiei and E h C(x) 2 2 i = i=1 pix2 i . Letting q := mini pi, we get i=1 pix2 i = E h C(x) 2 2 i = E [C(x)] , x . So, C B1(q, 1) and C B2(q, 1). For the third class, note that E h C(x) x 2 2 i = i=1 (1 pi)x2 i (1 q) x 2 2 . Hence, C B3(1 On Biased Compression for Distributed Learning B.3 Proof of Lemma 10: Adaptive Random Sparsification From the definition of the compression operator, we have E h C(x) 2 2 i = E x2 i = |xi| x 1 x2 i = x 3 3 x 1 , E [ C(x), x ] = E x2 i = x 3 3 x 1 , whence β = 1. Furthermore, by Chebychev s sum inequality, we have 1 d2 x 1 x 2 2 = 1 d|xi|x2 i = 1 which implies that α = 1 d, δ = d. So, C B1( 1 d, 1), C B2( 1 d, 1), and C B3(d). B.4 Proof of Lemma 11: Top-k sparsification Clearly, C(x) 2 2 = Pd i=d k+1 x2 (i) and C(x) x 2 2 = Pd k i=1 x2 (i). Hence k d x 2 2 C(x) 2 2 = C(x), x x 2 2 , C(x) x 2 2 1 k d, 1), C B2(k d, 1), and C B3( d B.5 Proof of Lemma 12: General Unbiased Rounding The unbiasedness follows immediately from the definition (12) i=1 E [C(x)i] ei = i=1 sign(xi) ak ak+1 |xi| ak+1 ak + ak+1 |xi| ak ak+1 ak i=1 xiei = x. (33) Since the rounding compression operator C applies to each coordinate independently, without loss of generality we can consider the compression of scalar values x = t > 0 and show that E C(t)2 ζ t2. From the definition we compute the second moment as follows E C(t)2 = a2 k ak+1 t ak+1 ak +a2 k+1 t ak ak+1 ak = (ak +ak+1)t akak+1 = t2 +(t ak)(ak+1 t), (34) from which E C(t)2 t2 = 1 + 1 ak t 1 , ak t ak+1. (35) Checking the optimality condition, one can show that the maximum is achieved at t = 2akak+1 ak + ak+1 = 2 1 ak + 1 ak+1 , Beznosikov and Horváth and Richtárik and Safaryan which being the harmonic mean of ak and ak+1, is in the range [ak, ak+1]. Plugging it to the expression for variance we get ak+1 + ak+1 Thus, the parameter ζ for general unbiased rounding would be ζ = sup t>0 t2 = sup k Z sup ak t ak+1 ak+1 + ak+1 B.6 Proof of Lemma 13: General Biased Rounding From the definition (13) of compression operator C we derive the following inequalities 2ak ak + ak+1 2 x 2 2 C(x) 2 2, C(x) 2 2 sup k Z 2ak+1 ak + ak+1 C(x), x , inf k Z 2ak ak + ak+1 x 2 2 C(x), x , which imply that C B1(α, β) and C B2(γ, β), with β = sup k Z 2ak+1 ak + ak+1 , γ = inf k Z 2ak ak + ak+1 , α = γ2. For the third class B3(δ), we need to upper bound the ratio C(x) x 2 2 / x 2 2. Again, as C applies to each coordinate independently, without loss of generality we consider the case when x = t > 0 is a scalar. From definition (13), we get t2 = min 1 ak 2 , ak t ak+1. (36) It can be easily checked that 1 ak t 2 is an increasing function and 1 ak+1 t 2 is a decreasing function of t [ak, ak+1]. Thus, the maximum is achieved when they are equal. In contrast to unbiased general rounding, it happens at the middle of the interval, t = ak + ak+1 2 [ak, ak+1]. Plugging t into (36), we get (C(t ) t )2 t2 = ak+1 ak Given this, the parameter δ can be computed from δ = sup k Z sup ak t ak+1 t2 = sup k Z On Biased Compression for Distributed Learning which gives δ = sup k Z (ak + ak+1)2 and C B3(δ). B.7 Proof of Lemma 15: General Exponential Dithering The proof goes with the same steps as in Theorem 4 of Horváth et al. (2019a). To show the unbiasedness of C, first we show the unbiasedness of ξ(t) for t [0, 1] in the same way as (33) was done. Then we note that E [C(x)] = sign(x) x p E ξ |x| = sign(x) x p |x| To compute the parameter ζ, we first estimate the second moment of ξ as follows: b + 2 x2 i x 2p + 1 |xi| x p < b1 s |xi| b + 2 x2 i x 2p + 1 |xi| x p < b1 s |xi| Then we use this bound to estimate the second moment of compressor C: E h C(x) 2 2 i = x 2 p b + 2 x2 i x 2p + 1 |xi| x p < b1 s |xi| b + 2 x 2 2 + x p < b1 s |xi| x pb1 s b + 2 x 2 2 + min x 1 x pb1 s, d x 2 pb2 2s b + 2 x 2 2 + min d 1/2 x 2 x pb1 s, d x 2 pb2 2s b + 2 + d 1/rb1 s min 1, d 1/rb1 s x 2 2 = ζb x 2 2 , where r = min(p, 2) and Hölder s inequality is used to bound x p d 1/p 1/2 x 2 in case of 0 p 2 and x p x 2 in the case p 2. B.8 Proof of Lemma 16: Top-k Combined with Exponential Dithering From the unbiasedness of general dithering operator Cdith we have E [C(x)] = E [Cdith(Ctop(x))] = Ctop(x), Beznosikov and Horváth and Richtárik and Safaryan from which we conclude E [C(x)] , x = Ctop(x), x = Ctop(x) 2 2. Next, using Lemma 15 on exponential dithering we get E h C(x) 2 2 i ζb Ctop(x) 2 2 = ζb E [C(x)] , x , which implies β = ζb. Using Lemma 11 we show γ = k d as E [C(x)] , x = Ctop(x) 2 2 k Utilizing the derivations (34) and (35) it can be shown that E h Cdith(x) 2 2 i x 2 2 and therefore E h C(x) 2 2 i Ctop(x) 2 2 k Hence, α = k d. To compute the parameter δ we use Theorem 6, which yields δ = β Appendix C. Proofs for Section 3 We now perform analysis of CGD for compression operators in B1, B2 and B3, establishing Theorems 17, 18 and 19, respectively. C.1 Analysis for C B1(α, β) Lemma 22 Assume f is L-smooth. Let C B1(α, β). Then as long as 0 η 2 βL, for each x Rd we have E [f (x ηC( f(x)))] f(x) αη 1 ηβL Proof Letting g = f(x), we have E [f (x ηC(g))] E f(x) + g, ηC(g) + L 2 ηC(g) 2 2 = f(x) η E [C(g)] , g + η2L 2 E h C(g) 2 2 i (3) f(x) η E [C(g)] , g + η2βL 2 E [C(g)] , g = f(x) η 1 ηβL E [C(g)] , g Alternatively, we can write E [f (x ηC(g))] f(x) η E [C(g)] , g + η2L 2 E C(g) 2 2 β E C(g) 2 2 + η2L 2 E C(g) 2 2 Both approaches lead to the same bound. On Biased Compression for Distributed Learning Proof of Theorem 17 Proof Since f is µ-strongly convex, f(xk) 2 2 2µ(f(xk) f(x )). Combining this with Lemma 22 applied to x = xk and g = f(xk), we get E h f xk ηC( f(xk)) i f(x ) f(xk) f(x ) α β ηµ (2 ηβL) (f(xk) f(x )) β ηµ (2 ηβL) (f(xk) f(x )). C.2 Analysis for C B2(γ, β) Lemma 23 Assume f is L-smooth. Let C B2(γ, β). Then as long as 0 η 2 βL, for each x Rd we have E [f (x ηC( f(x)))] f(x) γη 1 ηβL Proof Letting g = f(x), we have E [f (x ηC(g))] E f(x) + g, ηC(g) + L 2 ηC(g) 2 2 = f(x) η E [C(g)] , g + η2L 2 E h C(g) 2 2 i (6) f(x) η 1 ηβL E [C(g)] , g (6) f(x) γη 1 ηβL Proof of Theorem 18 Proof Since f is µ-strongly convex, f(xk) 2 2 2µ(f(xk) f(x )). Combining this with Lemma 23 applied to x = xk and g = f(xk), we get E h f xk ηC( f(xk)) i f(x ) f(xk) f(x ) µγη(2 ηβL)(f(xk) f(x )) = (1 µγη(2 ηβL)) (f(xk) f(x )). Beznosikov and Horváth and Richtárik and Safaryan C.3 Analysis for C B3(δ) Lemma 24 Assume f is L-smooth. Let C B3(δ). Then as long as 0 η 1 L, for each x Rd we have E [f (x ηC( f(x)))] f(x) η 2δ f(x) 2 2 . Proof Letting g = f(x), note that for any stepsize η R we have E [f (x ηC(g))] E f(x) + g, ηC(g) + L 2 ηC(g) 2 2 = f(x) η E [C(g)] , g + η2L 2 E h C(g) 2 2 i . (37) Since C B3(δ), we have E h C(g) g 2 2 i 1 1 δ g 2 2. Expanding the square, we get g 2 2 2E [ C(g), g ] + E h C(g) 2 2 i 1 1 Subtracting g 2 2 from both sides, and multiplying both sides by η 2 (now we assume that η > 0), we get η E [C(g)] , g + η 2E h C(g) 2 2 i η Assuming that ηL 1, we can combine this with (37) and the lemma is proved. Proof of Theorem 19 Proof Since f is µ-strongly convex, f(xk) 2 2 2µ(f(xk) f(x )). Combining this with Lemma 24 applied to x = xk and g = f(xk), we get E h f xk ηC( f(xk)) i f(x ) f(xk) f(x ) ηµ δ (f(xk) f(x )) (f(xk) f(x )). Appendix D. Proofs for Section 4 Proof [Proof of Lemma 20] (a) As it was already mentioned, we have the following expressions for ωk rnd and ωk top: ωk rnd(x) = 1 k i=1 x2 i , ωk top(x) = i=1 x2 (i). The expected variance E ωk rnd for Rand-k is easy to compute as all coordinates are independent and uniformly distributed on [0, 1]: [0,1]d x2 i dx = Z 1 0 x2 i dxi = 1 On Biased Compression for Distributed Learning which implies E h ωk rnd(x) i = 1 k i=1 E x2 i = 1 k In order to compute the expected variance E ωk top for Top-k, we use the following formula from order statistics (see (31), (32) or (2.2.2), (2.2.3) of Arnold et al. (1992)) E h x2 (i) i Z [0,1]d x2 (i) dx = Γ(i + 2)Γ(d + 1) Γ(i)Γ(d + 3) = i(i + 1) (d + 1)(d + 2), (40) from which we derive E h ωk top i = i=1 E h x2 (i) i = 1 (d + 1)(d + 2) i=1 i(i + 1) = 1 (d + 1)(d + 2) (d k)(d k + 1)(d k + 2) Combining (39) and (41) completes the first relation. Thus, on average (w.r.t. uniform distribution) Top-k has roughly (1 k/d)2 times less variance than Rand-k. For the second relation, we use (38) and (40) for i = d and get E s1 top(x) E s1 rnd(x) = E h x2 (d) i d(d+1) (d+1)(d+2) 1 3 = 3d d + 2. Clearly, one can extend this for any k [d]. (b) Recall that for the standard exponential distribution (with λ = 1) probability density function (PDF) is given as follows: φ(t) = e t, t [0, ). Both mean and variance can be shown to be equal to 1. The expected saving E s1 rnd can be computed directly: E s1 rnd(x) = E x2 d = Var [xd] + E [xd]2 = 2. To compute the expected saving E s1 top(x) = E h x2 (d) i we prove the following lemma: Lemma 25 Let x1, x2, . . . , xd be an i.i.d. sample from the standard exponential distribution and yi := (d i + 1)(x(i) x(i 1)), 1 i d, where x(0) := 0. Then y1, y2, . . . , yd is an i.i.d. sample from the standard exponential distribution. see also https://en.wikipedia.org/wiki/Order_statistic, https://www.sciencedirect.com/ science/article/pii/S0167715212001940 Beznosikov and Horváth and Richtárik and Safaryan Proof The joint density function of x(1), . . . , x(d) is given by (see (32)) φx(1),...,x(d)(u1, . . . , ud) = d! i=1 φ(ui) = d! exp , 0 u1 . . . ud < . Next we express variables x(i) using new variables yi d , x(2) = y1 d + y2 d 1, . . . , x(d) = y1 d + y2 d 1 + . . . + yd, with the transformation matrix 1 d 0 . . . 0 1 d 1 d 1 . . . 0 ... ... ... ... 1 d 1 d 1 . . . 1 Then the joint density ψy1,...,yd(u) = ψy1,...,yd(u1, . . . , ud) of new variables y1, . . . , yd is given as follows ψy1,...,yd(u) = φx(1),...,x(d)(Au) |det A 1| = |det A| φx(1),...,x(d)(Au) Notice that d P i=1 ui = d P i=1 (Au)i and |det A| = 1/d!. Hence ψy1,...,yd(u) = exp , 0 u1 . . . ud , which means that variables y1, . . . yd are independent and have standard exponential distribution. Using this lemma we can compute the mean and the second moment of x(d) = Pd i=1 yi d i+1 as follows i=1 E yi d i + 1 E [yi] d i + 1 = Var [x(d)] = i=1 Var yi d i + 1 Var [yi] (d i + 1)2 = from which we conclude the lemma as E s1 top(x) = E h x2 (d) i = Var [x(d)] + E x(d) 2 = On Biased Compression for Distributed Learning D.1 Proof of Theorem 21 (Convergence guarantees for Algorithm 1) In this section, we include our analysis for the Distributed SGD with biased compression. Our analysis is closely related to the analysis of Stich and Karimireddy (2019). We start with the definition of some auxiliary objects: Definition 26 The sequence {ak}k 0 of positive values is τ-slow decreasing for parameter τ: ak+1 ak, ak+1 1 + 1 ak, k 0 (42) The sequence {ak}k 0 of positive values is τ-slow increasing for parameter τ: ak+1 ak, ak+1 ak 1 + 1 i=1 ek i , k 0 (44) i=1 gk i (45) It is easy to see: xk+1 = xk+1 1 (22),(23) = i=1 [ek i + ηkgk i gk i ] i=1 gk i (46) Lemma 27 If ηk 1 4L(1+2B/n), k 0, then for { xk}k 0 defined as in (44), 2 E h f(xk) f i + 3Lηk E xk xk 2 + (ηk)2 C + 2BD Proof We consider the following equalities, using the relationship between xk+1 and xk: (45),(46) = xk x 2 2 2ηk gk, xk x + (ηk)2 gk 2 2 2ηk gk, xk x + (ηk)2 gk 2 2 + 2ηk gk, xk xk . Beznosikov and Horváth and Richtárik and Safaryan Taking the conditional expectation conditioned on previous iterates, we get 2 2ηk E h gki , xk x + (ηk)2 E gk 2 + 2ηk E h gki , xk xk (19),(45) = xk x 2 2 2ηk E h gki , xk x + 2ηk E h gki , xk xk 2 2ηk E h gki , xk x 2 + 2 f(xk), 1 + 2ηk E h gki , xk xk . Given the unbiased stochastic gradient (E ξk i = 0): 2 2ηk f(xk), xk x +(ηk)2 f(xk) 2 2 + (ηk)2 E + 2ηk f(xk), xk xk Using that ξk i mutually independent and E ξk i = 0 we have: (30) xk x 2 2 2ηk f(xk), xk x +(ηk)2 f(xk) 2 2 + (ηk)2 1 i=1 E ξk i 2 + 2ηk f(xk), xk xk (24) xk x 2 2 2ηk f(xk), xk x +(ηk)2 f(xk) 2 +2ηk f(xk), xk xk (27) xk x 2 2 2ηk f(xk), xk x +(ηk)2 2L(f(xk) f(x )) + (ηk)2 +2ηk f(xk), xk xk . (48) On Biased Compression for Distributed Learning All fi are L-smooth and µ-strongly convex, thus f is L-smooth and µ-strongly convex. We can rewrite 1 fi(xk) 2 2: fi(xk) fi(x ) + fi(x ) 2 fi(xk) fi(x ) 2 2 + fi(x ) 2 2 h 2L fi(xk) fi(x ) fi(x ), xk x + fi(x ) 2 2 i . Using definition of D = 1 n Pn i=1 fi(x ) 2 2: 2 4L f(xk) f(x ) + 2D (49) Substituting (49) to (48): 2 2ηk f(xk), xk x + (ηk)2 2L 1 + 2B (f(xk) f(x )) +(ηk)2 C + 2BD n + 2ηk f(xk), xk xk (50) By (25) we have for f: 2 f(xk), xk x µ xk x 2 2 2(f(xk) f ). (51) Using (28) with ξ = 1/2L and L-smothness of f (27): 2 f(xk), xk xk 1 2 + 2L xk xk 2 2 f(xk) f + 2L xk xk 2 By (30) for xk x 2 2, we get: 2 + xk xk 2 Plugging (51), (52), (53) into (50): 2 ηk 1 ηk 2L 1 + 2B +ηk(2L + µ) xk xk 2 2 + (ηk)2 C + 2BD The lemma follows by the choice ηk 1 4L(1+2B/n) and L µ. Beznosikov and Horváth and Richtárik and Safaryan Lemma 28 ηk 1 14(2δ+B)L, k 0 and {(ηk)2}k 0 2δ-slow decreasing. Then (1 1/δ) 49L(2δ + B) k j (f(xj) f(x )) 2D + C 2δ + B Furthermore, for any 4δ-slow increasing non-negative sequence {wk}k 0 it holds: k=0 wk(E h f(xk) i f(x )) + 3δD + 3C k=0 wkηk.(55) Proof We prove the first part of the statement: (22) = 1 n E ek i + ηkgk i gk i 2 i=1 E ek i + ηkgk i C(ek i + ηkgk i ) 2 ek i + ηkgk i 2 (19) = 1 1/δ ek i + ηk fi(xk) + ηkξk i 2 Here we have taken into account that the operator of full expectation is a combination of operators of expectation by the randomness of the operator and the randomness of the stochastic gradient, i.e. E [ ] = EC [E [ ]]. Given the unbiased stochastic gradient (E ξk i = 0): ek i + ηk fi(xk) 2 ek i + ηk fi(xk) 2 2 + (ηk)2 B fi(xk) 2 On Biased Compression for Distributed Learning Using (29) with some ξ: (1 + ξ) ek i 2 2 + (ηk)2 1 + 1 2 + (ηk)2B fi(xk) 2 + (ηk)2 1 + 1 (ηk)2 1 + 1 ξ + B 4L(f(xk) f(x )) + 2D + (ηk)2C Using the recurrence for 1 ek i 2 2 , and let ξ = 1 2(δ 1), then (1 + 1/ξ) 2δ, and (1 1/δ)(1 + ξ) = (1 1/2δ) we have j=0 (ηj)2 1 1 (1 + ξ) k j 1 + 1 ξ + B 4L(E f(xj) f(x )) + 2D j=0 (ηj)2 1 1 (1 + ξ) k j C j=0 (ηj)2 1 1 k j (2δ + B) 4L(E f(xj) f(x )) + 2D + C . Beznosikov and Horváth and Richtárik and Safaryan For 2δ-slow decreasing {(ηk)2}k 0 by definition (42) we get that (ηj)2 (ηk)2 1 + 1 4δ k j. Due to the fact that (1 1/2δ)(1 + 1/4δ) (1 1/4δ), we have: j=0 (ηk)2 1 + 1 k j (2δ + B) 4L(E f(xj) f(x )) + 2D j=0 (ηk)2 1 + 1 k j 4L E f(xj) f(x ) # + (ηk)2 1 1 4δ[C + 2D(2δ + B)] . As the last step, we use formula for geometric progression in the following way: By observing that the choice of the stepsize ηk 1 14(2δ+B)L: (1 1/δ) 49L(2δ + B) k j (E f(xj) f(x )) 2D + C 2δ + B which concludes the proof of (54). For the second part, we use the previous results. Summing over all k: (54) (1 1/δ) 49L(2δ + B) k=0 wk k 1 X k j 1 E f(xj) f(x ) 2D + C 2δ + B On Biased Compression for Distributed Learning For 2δ-slow decreasing {(ηk)2}k 0, it holds (ηk 1)2 (ηk)2 1 + 1 4δ which follows from (42) and ηk 1 ηk 1+ 1 4δ and for 4δ-slow increasing {wk}k 0 by (43) we have wk wk j 1+ 1 (54) (1 1/δ) 49L(2δ + B) k=0 wk k 1 X k j 1 E f(xj) f(x ) 2D + C 2δ + B (1 1/δ) 49L(2δ + B) j=0 wj 1 + 1 k j E f(xj) f(x ) 2D + C 2δ + B (1 1/δ) 49L(2δ + B) k j E f(xj) f(x ) 2D + C 2δ + B (1 1/δ) 49L(2δ + B) k=0 wk E h f(xk) i f(x ) X 2D + C 2δ + B Observing P j=0(1 1/8δ)j 8δ and using δ 1/2δ+B 1/2 concludes the proof. Lemma 29 (Lemma 11, Stich and Karimireddy (2019)) For decreasing stepsizes ηk := 2 a(κ+k) k 0, and weights {wk := (κ + k)}k 0 for parameters κ 1, it holds for every non- negative sequence {rk}k 0 and any a > 0, c 0 that ΨK := 1 W K 1 aηk rk wk ηk rk+1 + cηkwk aκ2r0 where W K := PK k=0 wk. Proof We start by observing that 1 aηk rk = a 2(κ + k)(κ + k 2)rk = a 2 (κ + k 1)2 1 a 2(κ + k 1)2 . (56) Beznosikov and Horváth and Richtárik and Safaryan By plugging in the definitions of ηk and wk in ΨK, we end up with the following telescoping sum: ΨK (56) 1 W K 2(κ + k 1)2rk a 2(κ + k)2rk+1 + 2c a W K a(κ 1)2r0 2W K + 2c(K + 1) The lemma now follows from (κ 1)2 κ2 and W K = PK k=0(κ + k) = (2κ+K)(K+1) Lemma 30 (Lemma 12, Stich and Karimireddy (2019)) For every non-negative sequence {rk}k 0 and any parameters d a > 0, c 0, K 0, there exists a constant η 1 d, such that for constant stepsizes {ηk = η}k 0 and weights wk := (1 aη) (k+1) it holds ΨK := 1 W K 1 aηk rk wk ηk rk+1 + cηkwk = O dr0 exp a K Proof By plugging in the values for ηk and wk, we observe that we again end up with a telescoping sum and estimate ΨK = 1 ηW K wk 1rk wkrk+1 + cη ηW K + cη r0 η exp [ aηK] + cη , where we used the estimate W K w K (1 aη) K exp[aηK] for the last inequality. The lemma now follows by carefully tuning η. Lemma 31 (Lemma 13, Stich and Karimireddy (2019)) For every non-negative sequence {rk}k 0 and any parameters d 0, c 0, K 0, there exists a constant η 1 d, such that for constant stepsizes {ηk = η}k 0 it holds: ΨK := 1 K + 1 ηk + cηk (d a)r0 Proof For constant stepsizes ηt = η we can derive the estimate ΨK = 1 η(K + 1) (1 aη)rk rk+1 + cη (1 aη)r0 η(K + 1) + cη . We distinguish two cases: if r0 c(K+1) 1 d2 , then we chose the stepsize η = q r0 c(K+1) and get (K + 1)(2 p On Biased Compression for Distributed Learning on the other hand, if r0 c(K+1) > 1 d2 , then we choose η = 1 which concludes the proof. The proof of the main theorem follows Proof [Proof of the Theorem 21] It is easy to see that 1/14(2δ+B)L 1/4L(1+2B/n). This means that the Lemma 27 is satisfied. With the notation rk := E h xk+1 x 2 2 sk := E f(xk) f we have for any wk > 0: 2 sk (47) wk ηk rk+1 + ηkwk C + 2BD n + 3wk L E Substituting (55) and summing over k we have: ηk rk+1 + ηkwk C + 1 where C = C 1 + 1 n + 3δ . This can be rewritten as k=0 wksk 4 W K ηk rk+1 + ηkwk C . First, when the stepsizes ηk = 4 µ(κ+k), it is easy to see that ηk 1 14(2δ+B)L: µ µ 56(2δ + B)L = 1 14(2δ + B)L Not difficult to check that {(ηk)2}k 0 is 2δ slow decreasing: (ηk)2 = κ + k + 1 2 1 + 1 κ + k 2 = 1 + µ 56(2δ + B)L Furthermore, the weights {wk = κ + k}k 0 are 4δ-slow increasing: wk = κ + k + 1 κ + k = 1 + 1 κ + k 1 + 1 κ = 1 + µ 56(2δ + B)L 1 + 1 The conditions for Lemma 29 are satisfied, and we obtain the desired statement. For the second case, the conditions of Lemma 30 are easy to check (see the previous paragraph). The claim follows by this lemma. Finally, for the third claim, we invoke Lemma 31. Beznosikov and Horváth and Richtárik and Safaryan Saurabh Agarwal, Hongyi Wang, Kangwook Lee, Shivaram Venkataraman, and Dimitris Papailiopoulos. Accordion: Adaptive gradient communication via critical learning regime identification. ar Xiv preprint ar Xiv:2010.16248, 2020. Ahmad Ajalloeian and Sebastian U. Stich. On the Convergence of SGD with Biased Gradients. ar Xiv preprint ar Xiv:2008.00051, 2021. Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709 1720, 2017. Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and Cédric Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5977 5987, 2018. Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200 1205. ACM, 2017. Barry C. Arnold, N. Balakrishnan, and H. N. Nagaraja. A First Course in order Statistics. John Wiley and Sons Inc., 1992. Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009. Tom B. Brown et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020. Jean-Baptiste Cordonnier. Convex optimization using sparsified stochastic gradient descent with memory. Technical report, École Polytechnique Fédérale de Lausanne, 2018. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. Ilyas Fatkhullin, Igor Sokolov, Eduard Gorbunov, Zhize Li, and Peter Richtárik. EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback. ar Xiv preprint ar Xiv:2110.03294, 2021. Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. In The 23rd International Conference on Artificial Intelligence and Statistics, 2020a. Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, and Peter Richtárik. Linearly converging error compensated SGD. In 34th Conference on Neural Information Processing Systems, 2020b. On Biased Compression for Distributed Learning Samuel Horváth and Peter Richtárik. A better alternative to error feedback for communicationefficient distributed learning. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=v YVI1CHPa Qg. Samuel Horváth, Chen-Yu Ho, Ľudovít Horváth, Atal Narayan Sahu, Marco Canini, and Peter Richtárik. Natural compression for distributed deep learning. ar Xiv preprint ar Xiv:1905.10988, 2019a. Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter Richtárik. Stochastic distributed learning with gradient quantization and variance reduction. ar Xiv preprint ar Xiv:1904.05115, 2019b. Peter Kairouz and et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for on-device federated learning. Ar Xiv, abs/1910.06378, 2019a. Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi. Error feedback fixes Sign SGD and other gradient compression schemes. ar Xiv preprint ar Xiv:1901.09847, 2019b. Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local SGD on identical and heterogeneous data. In The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020a. Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M Gower, and Peter Richtárik. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. ar Xiv preprint ar Xiv:2006.11573, 2020b. Jakub Konečný, H. Brendan Mc Mahan, Felix Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: strategies for improving communication efficiency. In NIPS Private Multi-Party Machine Learning Workshop, 2016. Zhen-Zhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: challenges, methods, and future directions. ar Xiv preprint ar Xiv:1908.07873, 2019. Zhize Li and Peter Richtárik. A unified analysis of stochastic gradient methods for nonconvex federated optimization. ar Xiv preprint ar Xiv:2006.07013, 2020. Hyeontaek Lim, David G Andersen, and Michael Kaminsky. 3lc: Lightweight and effective traffic compression for distributed machine learning. ar Xiv preprint ar Xiv:1802.07389, 2018. Beznosikov and Horváth and Richtárik and Safaryan Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. Co RR, abs/1712.01887, 2017a. URL http://arxiv.org/abs/1712.01887. Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. ar Xiv preprint ar Xiv:1712.01887, 2017b. Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR 2018 - International Conference on Learning Representations, 2018. H Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017. Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, and Peter Richtárik. Distributed learning with compressed gradient differences. ar Xiv preprint ar Xiv:1901.09269, 2019. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019. Peter Richtárik and Martin Takáč. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433 484, 2016. Peter Richtárik, Igor Sokolov, and Ilyas Fatkhullin. EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback. In 35nd Conference on Neural Information Processing Systems, 2021. Mher Safaryan and Peter Richtárik. Stochastic sign descent methods: New algorithms and better theory. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021. Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan R. K. Ports, and Peter Richtárik. Scaling distributed machine learning with in-network aggregation. ar Xiv preprint ar Xiv:1903.06701, 2019. Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014, September 2014. On Biased Compression for Distributed Learning Sebastian U. Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. ar Xiv preprint ar Xiv:1909.05350, 2019. Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4447 4458. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7697-sparsified-sgd-with-memory.pdf. Haobo Sun, Yingxia Shao, Jiawei Jiang, Bin Cui, Kai Lei, Yu Xu, and Jiang Wang. Sparse gradient compression for distributed SGD. In Guoliang Li, Jun Yang, Joao Gama, Juggapong Natwichai, and Yongxin Tong, editors, Database Systems for Advanced Applications, pages 139 155, Cham, 2019. Springer International Publishing. ISBN 978-3-030-18579-4. Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In 22nd International Conference on Artificial Intelligence and Statistics, volume 89 of PMLR, pages 1195 1204, 2019. Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Power SGD: Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems 32 (Neur IPS), 2019. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461, 2018. Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1509 1519, 2017. Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5325 5333, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zip ML: Training linear models with end-to-end low precision, and a little bit of deep learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4035 4043, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. Shen-Yi Zhao, Yinpeng Xie, Hao Gao, and Wu-Jun Li. Global momentum compression for sparse communication in distributed SGD. ar Xiv preprint ar Xiv:1905.12948, 2019. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations Beznosikov and Horváth and Richtárik and Safaryan by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19 27, 2015.