# adaptive_compression_for_communicationefficient_distributed_training__1770881b.pdf

Published in Transactions on Machine Learning Research (07/2023)

Adaptive Compression for Communication-Efﬁcient Distributed Training

Maksim Makarenko maksim.makarenko@kaust.edu.sa King Abdullah University of Science and Technology Elnur Gasanov elnur.gasanov@kaust.edu.sa King Abdullah University of Science and Technology Rustem Islamov rustem.islamov@ip-paris.fr Institut Polytechnique de Paris Abdurakhmon Sadiev abdurakhmon.sadiev@kaust.edu.sa King Abdullah University of Science and Technology Peter Richtárik peter.richtarik@kaust.edu.sa King Abdullah University of Science and Technology

Reviewed on Open Review: https://openreview.net/forum?id=Rb6VDOHeb B

We propose Adaptive Compressed Gradient Descent (Ada CGD) a novel optimization algorithm for communication-efﬁcient training of supervised machine learning models with adaptive compression level. Our approach is inspired by the recently proposed three point compressor (3PC) framework of Richtárik et al. (2022), which includes error feedback (EF21), lazily aggregated gradient (LAG), and their combination as special cases, and offers the current state-of-the-art rates for these methods under weak assumptions. While the above mechanisms offer a ﬁxed compression level or adapt between two extreme compression levels, we propose a much ﬁner adaptation. In particular, we allow users to choose between selected contractive compression mechanisms, such as Top-𝐾sparsiﬁcation with a user-deﬁned selection of sparsiﬁcation levels 𝐾, or quantization with a user-deﬁned selection of quantization levels, or their combination. Ada CGD chooses the appropriate compressor and compression level adaptively during the optimization process. Besides i) proposing a theoreticallygrounded multi-adaptive communication compression mechanism, we further ii) extend the 3PC framework to bidirectional compression, i.e., allow the server to compress as well, and iii) provide sharp convergence bounds in the strongly convex, convex, and nonconvex settings. The convex regime results are new even for several key special cases of our general mechanism, including 3PC and EF21. In all regimes, our rates are superior compared to all existing adaptive compression methods.

1 Introduction

Training machine learning models is computationally expensive and time-consuming. In recent years, researchers have tended to use increasing datasets, often distributed over several devices, throughout the training process in order to improve the effective generalization ability of contemporary machine learning frameworks(Vaswani et al., 2019). By the word device or node, we refer to any gadget that can store data, compute loss values and gradients (or stochastic gradients), and communicate with other different devices. For example, this distributed setting appears in federated learning (FL) (Koneˇcn y et al., 2016; Mc Mahan et al., 2017; Lin et al., 2018). FL is a machine learning setting where a substantial number of strongly heterogeneous clients attempt to cooperatively train a model using the varied data stored on these devices without violating clients information privacy(Kairouz et al.). In this setting, distributed methods are very efﬁcient(Goyal et al., 2017; You et al., 2019), and therefore, federated frameworks have attracted tremendous attention in recent years.

Published in Transactions on Machine Learning Research (07/2023)

Algorithm 1 DCGD method with master compression

1: Input: starting point 𝑥0 R𝑑; 𝑔0, 𝑔0 𝑖 R𝑑for 𝑖= 1, , 𝑛(known by nodes), 𝑔0 = 1

𝑛 𝑛 𝑖=1 𝑔0 𝑖(known by master); learning rate 𝛾> 0, worker compressor ℳW, master compressor ℳ𝑀. 2: for 𝑡= 0,1,2, , 𝑇 1 do 3: Server broadcasts 𝑔𝑡to all workers 4: for all devices 𝑖= 1, . . . , 𝑛in parallel do

5: 𝑥𝑡+1 = 𝑥𝑡 𝛾 𝑔𝑡

6: 𝑔𝑡+1 𝑖 = ℳ𝑊,𝑡 𝑖 ( 𝑓𝑖(𝑥𝑡+1)) 7: Communicate 𝑔𝑡+1 𝑖 to the server 8: end for 9: Server aggregates received gradient estimators 𝑔𝑡+1 = 1

𝑛 𝑛 𝑖=1 𝑔𝑡+1 𝑖 10: 𝑔𝑡+1 = ℳ𝑀,𝑡(𝑔𝑡+1)

11: end for

Dealing with the distributed environment, we consider the following optimization problem

{ 𝑓(𝑥) := 1

𝑖=1 𝑓𝑖(𝑥) } , (1)

where 𝑥 R𝑑is the parameter vector of a training model, 𝑑is the dimensionality of the problem (number of trainable features), 𝑛is the number of workers/devices/nodes, and 𝑓𝑖(𝑥) is the loss incurred by model 𝑥on data stored on worker 𝑖. The loss function 𝑓𝑖: R𝑑 R often has the form of expectation of some random function 𝑓𝑖(𝑥) := E𝜉 𝒟𝑖[𝑓𝜉(𝑥)] with 𝒟𝑖being the distribution of training data owned by worker 𝑖. In federated learning, these distributions can be different (so-called heterogeneous case). This ﬁnite sum function form allows us to capture the distributed nature of the problem in a very efﬁcient way.

The most effective models are frequently over-parameterized, which means that they contain more parameters than there are training data samples(Arora et al., 2018).

In this case, distributed methods may experience communication bottleneck: the communication cost for the workers to transfer sensitive information in joint optimization can exceed by multiple orders of magnitude the cost of local computation(Dutta et al., 2020). One of the practical methods to transfer information more efﬁciently is to apply a local compression operator (Seide et al., 2014; Suresh et al., 2017; Koneˇcn y & Richtárik, 2018; Zhang et al., 2017) to the model s parameters (gradients or tensors) communicated across different clients. Deﬁnition 1. A (possibly randomized) mapping 𝒞: R𝑑 R𝑑is called a compression operator if transmission of compressed tensor 𝒞(𝑥) incurs less communication cost than the transfer of initial tensor 𝑥.

Although compression decreases the number of bits transferred during each communication cycle, it also introduces extra compression errors. As a result, the number of rounds necessary to obtain a solution with the appropriate accuracy typically increases. However, as the trade-off frequently appears to favor compression over no compression, compression has been proven to be effective in practice.

Distributed Compressed Gradient Descent (DCGD) (Khirirat et al., 2018) provides the simplest and universal mechanism for distributed communication-efﬁcient training with compression. With the given learning rate 𝛾, DCGD implements the following update rule

𝑥𝑡+1 = 𝑥𝑡 𝛾1

𝑖=1 𝑔𝑡 𝑖, 𝑔𝑡 𝑖= ℳ𝑡 𝑖( 𝑓𝑖(𝑥𝑡)). (2)

Here, 𝑔𝑡 𝑖represents an estimated gradient, the result of the mapping of original dense and high-dimensional gradient 𝑓𝑖(𝑥𝑡) R𝑑into a vector of same size that can be transferred efﬁciently with far fewer bits via ℳ𝑡 𝑖compression mechanism.

In some cases (Tang et al., 2020; Philippenko & Dieuleveut, 2020; Fatkhullin et al., 2021), it is desirable to add compression on the server side to have efﬁcient communication between the server and clients in both directions. One could easily extend the general framework of DCGD to the case of bidirectional compression. If we deﬁne the general master compression mechanism as ℳ𝑀,𝑡and the worker compression mechanism as ℳ𝑊,𝑡 𝑖 , we could formally write the general bidirectional DCGD algorithm as Algorithm 1.

Published in Transactions on Machine Learning Research (07/2023)

2 Motivation and Background

Our primary motivation in this work is to design a compression mechanism that can vary the level of gradient compression depending on a local optimality condition. This section of the paper introduces several key concepts essential for the proposed adaptive scheme. We describe constant contractive compressors, discuss what adaptive compressors already exist in the literature, and rehash a lazy aggregation mechanism the precursor for our adaptive compression.

2.1 Constant contractive compressors

Most methods employing gradient compression mechanisms use a compressor with a constant compression level. In this approach (Beznosikov et al., 2020; Khirirat et al., 2018), one sets

ℳ𝑡 𝑖(𝑥) 𝒞(𝑥), (3)

where 𝒞: R𝑑 R𝑑is a compression operator. There are two large classes of compression operators (or compressors) widely presented in the literature: i) unbiased compression operators and ii) biased or contractive compression operators. Our focus in this paper is on biased operators, the deﬁnition of which we provide below.

Deﬁnition 2 (Biased or contractive compression operator). A mapping 𝒞: R𝑑 R𝑑is called biased or contractive compression operator if there exists 0 < 𝛼 1 such that

E [ 𝒞(𝑥) 𝑥 2] (1 𝛼) 𝑥 2, 𝑥 R𝑑. (4)

Top-𝐾(Alistarh et al., 2018) and adaptive random (Beznosikov et al., 2020) sparsiﬁcation compressors are typical examples of contractive compressors. Due to the biased nature of these compressors, until recently, there was a gap between the experimental and theoretical development of gradient descent methods based on contractive compressors. Thus, during the last years, algorithmic approaches have provided several methods of high practical importance, most notable of which is the so-called error feedback mechanism (Seide et al., 2014), ﬁxing a divergence issue that appeared in practice and theory (Beznosikov et al., 2020). In contrast, in the theoretical development, until very recently, analytical studies offered weak sublinear (Stich et al., 2018; Karimireddy et al., 2019; Horváth & Richtárik, 2021) convergence rates under, in some cases, strong unrealistic assumptions. Recently, Richtárik et al. (2021) ﬁxed this by providing a novel algorithmic and theoretical development that recovers GD 𝒪(1/𝑇) rates, with the analysis using standard assumptions only. Fatkhullin et al. (2021) subsequently extended the framework by including several algorithmic and theoretical extensions, such as bidirectional compression and client stochasticity. Despite these advances, there are still many challenges in the theoretical understanding of these methods. One such challenge is a lack of a thorough theoretical study of error feedback methods in a convex setting.

2.2 Existing adaptive compressors

Using a static compression level of the compressor for all clients could limitate the optimization framework s capability. Conversely, adjusting the compression level for every client could help reduce overall training time, for example, in hardware heterogeneous cases (Horváth et al., 2021; Abdelmoniem & Canini, 2021).Ideally, the optimizer should be able to deﬁne the particular compression level for each client separately based on the local client s data.

Despite the signiﬁcant practical interest in developing such methods, there is currently little research and understanding of adaptive mechanisms of this type. Only a few papers (Qu et al., 2021; Hönig et al., 2021; Mishchenko et al., 2022; Alimohammadi et al., 2022) provide convergence guarantees with explicit rates, and most of them are designed for the speciﬁc type of compressors only, mostly quantizers (Guo et al., 2020; Wen et al., 2017). So, in (Jhunjhunwala et al.,

2021), the authors design a mechanism for adaptive change of quantization level 𝑠𝑘

𝑓(𝑥0) 𝑓(𝑥𝑘) of a random dithering operator (Alistarh et al., 2017). DAda Quant (Hönig et al., 2021) and Fed DQ (Qu et al., 2021) suggest ascending and descending quantizations throughout the training. AQUILA (Zhao et al., 2022) and AGQ (Mao et al., 2021) build an adaptive quantization on top of the Lazily Aggregated Quantized (LAQ) gradient algorithm (Sun et al., 2019). Faghri et al. (2020) proposes ALQ, an adaptive method for quantizing gradients that minimizes the excess variance of quantization given an estimate of the distribution of the gradients. It adapts individual quantization levels iteratively to

Published in Transactions on Machine Learning Research (07/2023)

Table 1 Summary of adaptive compressed methods. Ada CGD, proposed in this work, is superior to its counterparts in all three settings.

Paper Any 𝒞? Theory? Strongly convex rate(5) Convex rate General nonconvex rate

Jhunjhunwala et al. (2021) Abdelmoniem & Canini (2021)

Hönig et al. (2021) (1)

Qu et al. (2021) 𝒪( 𝐿Δ𝑓

Zhao et al. (2022) linear (3) Mao et al. (2021) linear (3) Khirirat et al. (2021)

Alimohammadi et al. (2022) 𝒪 ( Δ𝑓

𝑇 + 𝐿3𝑛 3 2

Zhong et al. (2021) 𝒪( 𝐶

𝑇) Faghri et al. (2020) 𝒪( 𝐶

Mishchenko et al. (2022) (4) 𝒪( 𝐿Δ𝑥

𝑇 + 𝜎2 *+𝜀 𝐿𝑛) 𝒪( 𝐿Δ𝑓

THIS WORK ( 1 min{ 𝜇 𝑀2 , 𝐴min } ) 𝑇 𝒪( 𝑀1

𝑇 ) 𝒪 ( 2Δ𝑓𝑀2+𝐶/𝐴min

(1) Rates, as suggested in the original paper, are derived from (Reisizadeh et al., 2020) and calculated for non-local full gradient setup (𝜎2 = 0, 𝜏= 1). (2) We show the rate for non-local full gradient setup, i.e. 𝜎2 = 0 and 𝜏= 1. (3) Their work does not present any explicit rate. (4) 𝜀> 0 is a parameter of Int SGD algorithm. (5) This column incorporates rates for the Polyak-Łojasiewicz nonconvex setting for its similarity with the strongly convex setting.

Notation: 𝜅 = 𝐿 𝜇, {𝐶𝑖}3 𝑖=1 are scalar constants, Δ𝑥 = 𝑥0 𝑥* 2, Δ𝑓 = 𝑓(𝑥0) 𝑓*, 𝑀1 = max{𝐿 +

𝐴min , 1 𝐴min }, 𝑀2 = 𝐿 + 𝐿+

𝐴min (see Lemma 2 for details).

(6) These rates were obtained for nonstandard compressor types 𝐶(𝑥) 𝑥 2 𝐵, where B is a constant value.

minimize the expected normalized variance. L-GRECO (Alimohammadi et al., 2022) dynamically adapts the degree of compression across the model s different layers during training using dynamic programming. Int SGD (Mishchenko et al., 2022) adaptively sets the scaling parameter 𝛼𝑘of a vector plugged in a randomized integer rounding operator. CLAN (Zhong et al., 2021) deals with constant contractive compressors and combines adaptive stepsize rule similar to ADAM(Kingma & Ba, 2017) with error-feedback applied on both server and client sides. The work of (Agarwal et al., 2021) presents Accordion, an adaptive compression algorithm that avoids over-compressing gradients in critical learning regimes, which can have a negative impact on model performance. The critical regime is detected by the rate of change in gradient norms. CAT S+Q (Khirirat et al., 2021) proposes an adaptive way to choose 𝑘: the top-𝑘elements of the gradient at iteration 𝑖, only signs of which clients send to the server along with the gradient norm. Table 1 provides a detailed comparison of these works.

2.3 Adaptive compression via selective (lazy) aggregation

We revisit the LAG mechanism, proposed by Chen et al. (2018), and consider it an alternative way to embed adaptivity into the framework by introducing communication skipping. According to the lazy aggregation communication scheme, each worker 𝑖only shares its local gradient if it is signiﬁcantly different from the last gradient shared previously. Otherwise, the worker decides to skip the communication round. One can consider LAG as an adaptive mechanism that chooses between two extremes: sending a full gradient or skipping the communication round.

Richtárik et al. (2022) recently extended this idea by introducing the CLAG algorithm that dispatches a compressed update when an old gradient estimate differs too much from the gradient at a new iteration or skips the communication at all. At each iteration of the algorithm, worker 𝑖updates its gradient estimate 𝑔𝑡+1 𝑖 by the following rule:

{ 𝑔𝑡 𝑖+ 𝒞( 𝑓𝑖(𝑥𝑡+1) 𝑔𝑡 𝑖) if 𝑓𝑖(𝑥𝑡+1) 𝑔𝑡 𝑖 2 > 𝜁 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) 2, 𝑔𝑡 𝑖 otherwise, (5)

where 𝜁> 0 is a trigger parameter, 𝑔𝑡 𝑖is a gradient estimate at previous iteration, and 𝒞is a biased compression operator. When the condition in the ﬁrst line, called a trigger condition, ﬁres, worker 𝑖sends an update 𝒞( 𝑓𝑖(𝑥𝑡+1) 𝑔𝑡 𝑖) to the server that smartly aggregates updates from workers and gets new full gradient estimate 𝑔𝑡+1 = (1/𝑛) 𝑛 𝑖=1 𝑔𝑡+1 𝑖 . When the trigger condition does not ﬁre, worker 𝑖skips communication, which means 𝑔𝑡+1 𝑖 = 𝑔𝑡 𝑖. Trigger parameter

Published in Transactions on Machine Learning Research (07/2023)

Table 2 Comparison of convergence guarantees results of methods employing lazy aggregation.

Method Adaptive compression? Bidirectional case Str cvx case Cvx case PŁ noncvx case General noncvx case

LAQ (Sun et al., 2019)

LENA (Ghadikolaei et al., 2021)

LAG (Richtárik et al., 2022)

CLAG (Richtárik et al., 2022)

Ada CGD (NEW, 2022)

𝜁controls how frequently the trigger condition will ﬁre, i.e., how often clients skip communication. The update rule in Equation (5) also could be seen as an adaptive compression switching between two extremes: compressing at a pre-deﬁned compression level or compressing at the maximum possible level, in other words, not sending anything at all.

Although both LAG and CLAG perform well in practice, their ﬁxed and limited compression levels could restrict their performance and make these methods sub-optimal. It is of particular practical interest to create a more general method with evolving ﬁne-tuned compression levels.

3 Summary of Contributions

We highlight our main contributions as follows:

New class of adaptive compressors. In (Richtárik et al., 2022), the authors propose the different classes of compressors uniﬁed through a single theory. Despite the large variability of the compression mechanisms, including the algorithms with lazy aggregation rule, the compression level in all of the considered algorithms is pre-deﬁned and constant during the training. In this work, we take a step further and formulate an extended class of adaptive 3PC compressors (Ada3PC) with tunable compression levels deﬁned by some general trigger rules. As original 3PC compressors, this class of compressors is very general and includes several speciﬁc compressors such as Ada CGD, which includes EF21 and CLAG as special cases. We build Ada3PC compressors upon a large class of biased compressors, such as Top-𝐾and adaptive random sparsiﬁcation.

Convergence guarantees with strong rates uniﬁed by a single 3PC theory. We provide a strong convergence bound for strongly convex, convex, and nonconvex settings. Compared with the adaptive methods outside the 3PC context, we provide a more elaborate theory with better convergence rates. For Ada CGD, we recover 𝒪(1/𝑇) rate of GD up to a certain constant in general nonconvex case. It is superior in comparison with 𝒪(1/

𝑇) rate (Hönig et al., 2021; Qu et al., 2021) for SOTA in adaptive compression outside 3PC context. The convergence theory in a convex set is of particular interest due to its novelty even in the case of 3PC for some key cases of Ada CGD, such as EF21 and CLAG. In other words, it is a new SOTA result for the error-feedback method in the convex setting.

Extension of 3PC theory with bidirectional compression. We extend 3PC methods with bidirectional compression, i.e., we allow the server to compress as well.

Table 2 compares Ada CGD with other described in the literature lazy algorithms.

4 Ada3PC: A Compression-Adaptive 3PC Method

4.1 3PC compressor

Richtárik et al. (2022) introduces the general concept of three point compressors. Here we provide its formal deﬁnition for consistency:

Deﬁnition 3. We say that a (possibly randomized) map 𝒞ℎ,𝑦(𝑥) : R𝑑 ℎ R𝑑 𝑦 R𝑑 𝑥 R𝑑is a three point compressor

(3PC) if there exist constants 0 < 𝐴 1 and 𝐵 0 such that the following relation holds for all 𝑥, 𝑦, ℎ R𝑑

E [ 𝒞ℎ,𝑦(𝑥) 𝑥 2] (1 𝐴) ℎ 𝑦 2 + 𝐵 𝑥 𝑦 2. (6)

Published in Transactions on Machine Learning Research (07/2023)

Authors show that EF21 compression mechanism satisﬁes Deﬁnition 3. Let 𝒞: R𝑑 R𝑑be a contractive compressor with contraction parameter 𝛼, and deﬁne

𝒞EF ℎ,𝑦(𝑥) := ℎ+ 𝒞(𝑥 ℎ). (7)

If we use this mapping to deﬁne a compression mechanism ℳ𝑡 𝑖via (2) within DCGD, we obtain the EF21 method.

Another variant of 3PC compressors introduced in (Richtárik et al., 2022) is CLAG compressor. Let 𝒞: R𝑑 R𝑑be a contractive compressor with contraction parameter 𝛼. Choosing a trigger 𝜁> 0, authors deﬁne the CLAG rule

𝒞CL ℎ,𝑦(𝑥) :=

{ ℎ+ 𝒞(𝑥 ℎ), if 𝑥 ℎ 2 > 𝜁 𝑥 𝑦 2, ℎ, otherwise, (8)

If we employ this mapping into DCGD method (2) as communication mechanism ℳ𝑡 𝑖, we obtain CLAG. It includes LAG compressor 𝒞L as a special case when compressor 𝒞is identity.

4.2 Adaptive 3PC compressor

We are ready to introduce the Adaptive Three Point (Ada3PC) Compressor.

Deﬁnition 4 (Ada3PC compressor). Let 𝒞1, 𝒞2, . . . , 𝒞𝑚be 3PC compressors: 𝒞𝑖: R3𝑑 R𝑑for all 𝑖. Let 𝑃1, 𝑃2, . . . , 𝑃𝑚 1 be conditions depending on ℎ, 𝑦, 𝑥, i.e. 𝑃𝑗: R𝑑 ℎ R𝑑 𝑦 R𝑑 𝑥 {0, 1} for all 𝑗. Then, the

adaptive 3PC compressor, associated with the above 3PC compressors and conditions, is deﬁned as follows:

𝒞ad ℎ,𝑦(𝑥) =

𝒞1 ℎ,𝑦(𝑥) if 𝑃1(ℎ, 𝑦, 𝑥), 𝒞2 ℎ,𝑦(𝑥) else if 𝑃2(ℎ, 𝑦, 𝑥), . . . , 𝒞𝑚 1 ℎ,𝑦(𝑥) else if 𝑃𝑚 1(ℎ, 𝑦, 𝑥), 𝒞𝑚 ℎ,𝑦(𝑥) otherwise.

Ada3PC compressor ﬁrst ﬁnds the smallest index 𝑗for which 𝑃𝑗(ℎ, 𝑦, 𝑥) = 1 (if such index does not exist, we set 𝑗= 𝑚). Then, Ada3PC applies 𝒞𝑗compressor on vector 𝑥.

4.3 Adaptive Compressed Gradient Descent

There are many ways to deﬁne meaningful and practical compressors in the context of the adaptive 3PC framework. Here we provide one particular, perhaps the simplest scheme, which we deﬁne as Ada CGD. In this scheme, we have a set of 𝑚EF21 compressors {𝒞EF ,𝑗 ℎ,𝑦(𝑥)}𝑗 1...𝑚sorted in order from the highest compression level to the lowest, i.e. 𝛼1 𝛼2 . . . 𝛼𝑚, where 𝛼𝑗is a corresponding contractive parameter of the 𝑗-th compressor. For example, if we use Top-𝐾in 𝒞EF ℎ,𝑦compressors, the ﬁrst and last indices correspond to the compressors with the smallest and the largest 𝐾, respectively. We choose the ﬁrst compressor, i.e. with the strongest compression, which satisﬁes a trigger rule. We design the 𝑗-th trigger rule following an intuition of lazy aggregation rule:

𝑃𝑗:= 𝑥 𝒞EF ,𝑗 ℎ,𝑦(𝑥) 2 𝜁 𝑥 𝑦 2. (10)

As in the original LAG rule, the left side of (10) presents the difference between the actual gradient and its estimate, while the right side compares gradient differences on neighboring iterations. The key difference of (10) trigger from LAG and CLAG rule (5) is that the left side of this trigger condition depends explicitly on the level of compression. This feature is essential as it enables the desired rule-based compressor selection. Altogether, we can deﬁne Ada CGD compressor formally.

Published in Transactions on Machine Learning Research (07/2023)

Deﬁnition 5 (Ada CGD compressor). Given the set of 𝑚EF21 compressors {𝒞EF ,𝑗 ℎ,𝑦(𝑥)}𝑗 1...𝑚, sorted in descending order of compression level and 𝜁 0, adaptive Ada CGD compressor is deﬁned with a switch condition as follows:

𝒞AC ℎ,𝑦(𝑥) =

ℎ if 𝑥 ℎ 2 𝜁 𝑥 𝑦 2, 𝒞EF ,1 ℎ,𝑦 (𝑥) else if 𝑥 𝒞EF ,1 ℎ,𝑦 (𝑥) 2 𝜁 𝑥 𝑦 2, . . . , 𝒞EF ,𝑚 1 ℎ,𝑦 (𝑥) else if 𝑥 𝒞EF ,𝑚 1 ℎ,𝑦 (𝑥) 2 𝜁 𝑥 𝑦 2, 𝒞EF ,𝑚 ℎ,𝑦 (𝑥) otherwise.

If 𝒞EF ,𝑚 ℎ,𝑦 uses Top-𝑑compression, i.e., identity operator, Ada CGD is an adaptive compressor composed of the whole spectrum of compressors from full compression, i.e., communication "skip", to zero compression, i.e., sending full gradient. Since the standalone "skip" connection is not 3PC operator, it may not be evident that Ada CGD is a special case of Ada3PC. For this reason, here we provide the following lemma: Lemma 1. Ada CGD is a special case of Ada3PC compressor.

It is easy to see that if 𝜁= 0 Ada CGD reduces to EF21. Similarly, CLAG is a special case of Ada CGD when 𝑚= 1.

In this section, we present theoretical convergence guarantees for Algorithm 1 with Ada3PC compressors in two new cases presented in Table 2. The results for general and PŁ nonconvex cases can be found in the appendix.

5.1 Assumptions

We rely on standard assumptions to get convergence rates of Algorithm 1. Assumption 1. The functions 𝑓1, . . . , 𝑓𝑛: R𝑑 R are convex, i.e.

𝑓𝑖(𝑥) 𝑓𝑖(𝑦) 𝑓𝑖(𝑦), 𝑥 𝑦 0, 𝑥, 𝑦 R𝑑, 𝑖. (12)

Assumption 2. The function 𝑓: R𝑑 R is 𝐿 -smooth, i.e.

𝑓(𝑥) 𝑓(𝑦) 𝐿 𝑥 𝑦 , 𝑥, 𝑦 R𝑑. (13)

Assumption 3. There exists 𝐿+ > 0 such that 1

𝑛 𝑛 𝑖=1 𝑓𝑖(𝑥) 𝑓𝑖(𝑦) 2 𝐿2 + 𝑥 𝑦 2 for all 𝑥, 𝑦 R𝑑. Let 𝐿+ be the smallest such number.

We borrow {𝐿 , 𝐿+} notation from (Szlendak et al., 2022). Assumption 3 avoids a stronger assumption on Lipschitz smoothness of individual functions 𝑓𝑖. Moreover, one can easily prove that 𝐿 𝐿+.

The next assumption is standard for analysis of practical methods (Ahn et al., 2020), Rajput et al. (2020). However, compared to previous works, we require a more general version. Assumption 4. We assume that there exists a constant Ω > 0 such that E[ 𝑥𝑡 𝑥* 2] Ω2, where 𝑥𝑡is any iterate generated by Algorithm 1. Assumption 5. The functions 𝑓1, . . . , 𝑓𝑛are differentiable. Moreover, 𝑓is bounded from below by an inﬁmum 𝑓inf R, i.e. 𝑓(𝑥) 𝑓inf for all 𝑥 R𝑑.

5.2 Adaptive 3PC is a 3PC compressor

While this may not be obvious at ﬁrst glance, Adaptive 3PC compressors belong to the class of 3PC compressors. We formalize this statement in the following lemma. Lemma 2. Let 𝒞ad be an adaptive 3PC compressor. Let {𝒞𝑖}𝑚 𝑖=1 be 3PC compressors associated with 𝒞ad, i.e. for all 𝑖 there exists constants 0 < 𝐴𝑖 1 and 𝐵𝑖 0, such that for all ℎ, 𝑦, 𝑥 R𝑑

E 𝐶𝑖 ℎ,𝑦(𝑥) 𝑥 2 (1 𝐴𝑖) ℎ 𝑦 2 + 𝐵𝑖 𝑥 𝑦 2.

Then, 𝒞ad is a 3PC compressor with constants 𝐴min := min{𝐴1, . . . , 𝐴𝑚} and 𝐵max := max{𝐵1, . . . , 𝐵𝑚}.

Published in Transactions on Machine Learning Research (07/2023)

Proof. Let us ﬁx ℎ, 𝑦, 𝑥 R𝑑and let 𝑗be the index, such that 𝑃𝑖(ℎ, 𝑦, 𝑥) = 0 for all 𝑖< 𝑗and, if 𝑗< 𝑚, 𝑃𝑗(ℎ, 𝑦, 𝑥) = 1. Then,

E 𝒞ad ℎ,𝑦(𝑥) 𝑥 2 = E 𝒞𝑗 ℎ,𝑦(𝑥) 𝑥 2 (6) (1 𝐴𝑗) ℎ 𝑦 2 + 𝐵𝑗 𝑥 𝑦 2

(1 𝐴min) ℎ 𝑦 2 + 𝐵max 𝑥 𝑦 2.

In the deﬁnition of Ada3PC compressor, we never specify what conditions 𝑃𝑖s are. They are completely arbitrary! The latter enables us to build inﬁnitely many new compressors out of a few notable examples, presented in (Richtárik et al., 2022).

5.3 Convergence

In this subsection, we show how assumptions we make about minimized functions and compressors affect the convergence rate of Algorithm 1.

Convergence for convex functions. The ﬁrst result assumes that ℳW in Algorithm 1 is a 3PC compressor, ℳM is an identity compressor: ℳM(𝑥) = 𝑥 𝑥 R𝑑.

Theorem 6. Let Assumptions 1, 2, 3 and 4 hold. In Algorithm 1, assume ℳW is a 3PC compressor, ℳM is an identity

compressor, and the stepsize 𝛾satisﬁes 0 < 𝛾 1/𝑀, where 𝑀= 𝐿 + 𝐿+

𝐴. Then, for any 𝑇 1 we have

E [ 𝑓(𝑥𝑇) ] 𝑓(𝑥 ) max { 1 𝛾, 1

𝐴 } 2(Ω2+Φ0)

where Φ𝑡:= 𝑓(𝑥𝑡) 𝑓(𝑥 ) + 𝛾

𝐴 1 𝑛 𝑛 𝑖=1 𝑓𝑖(𝑥𝑡) 𝑔𝑡 𝑖 2 for any 𝑡 0.

The theorem combined with Lemma 2 implies the following fact.

Corollary 1. Let the assumptions of Theorem 6 hold, assume ℳW is an Ada3PC compressor, ℳM is an identity compressor, and choose the stepsize 𝛾= 1 𝐿 +𝐿+

𝐴min . Then, for any 𝑇 1 we have

E [ 𝑓(𝑥𝑇) ] 𝑓(𝑥*) max { 𝐿 + 𝐿+

𝐴min , 1 𝐴min

Thus, to achieve E [ 𝑓(𝑥𝑇) ] 𝑓(𝑥*) 𝜀for some 𝜀> 0, the Ada3PC method requires

𝑇= 𝒪 ( max { 𝐿 + 𝐿+

𝐴min , 1 𝐴min

} 2(Ω2+Φ2 0) 𝜀 )

iterations.

Convergence for bidirectional method. Here, we analyze the case when meaningful compressors are applied on both communication directions, i.e., both ℳM and ℳW are 3PC compressors.

Theorem 7. Let Assumptions 3 and 5 hold. Let ℳM and ℳW be 3PC compressors and the stepsize in Algorithm 1 be set as 0 < 𝛾 1

𝐴M ( 1 + 3𝐵M(2 𝐴W)

𝐴M ) . (14)

Fix 𝑇and let 𝑥𝑇be chosen uniformly from {𝑥0,𝑥1, ,𝑥𝑇 1} uniformly at random. Then

E [ 𝑓( 𝑥𝑇) 2] 2Ψ0

where Ψ𝑡= 𝑓(𝑥𝑡) 𝑓inf + 𝛾 𝐴M 𝑔𝑡 𝑔𝑡 2 + 𝛾 𝐴W ( 1 + 3𝐵M(2 𝐴W)

𝐴M ) 1 𝑛 𝑛 𝑖=1 𝑔𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2 for any 𝑡 0.

Published in Transactions on Machine Learning Research (07/2023)

In the theorem, superscripts M and W denote master and worker compressor parameters, respectively. Theorem 7 implies the following corollary.

Corollary 2. Let the assumptions of Theorem 7 hold, assume ℳM and ℳW are Ada3PC compressors and the stepsize

6𝐵M max(𝐵W max+1)

𝐴M min + 2𝐵W max 𝐴M min

( 1+ 3𝐵M max(2 𝐴W min)

Fix 𝑇and let 𝑥𝑇be chosen uniformly from {𝑥0,𝑥1, ,𝑥𝑇 1} uniformly at random. Then, we have

E [ 𝑓( 𝑥𝑇) 2]

6𝐵M max(𝐵W max+1)

𝐴M min + 2𝐵W max 𝐴M min

( 1+ 3𝐵M max(2 𝐴W min)

Thus, to achieve E [ 𝑓( 𝑥𝑇) ] 2 𝜀2 for some 𝜀> 0, Algorithm 1 requires

6𝐵M max(𝐵W max+1)

𝐴M min + 2𝐵W max 𝐴M min

( 1+ 3𝐵M max(2 𝐴W min)

iterations.

6 Experiments

In this work, we use the similar setup described in (Richtárik et al., 2022). Namely, we aim to solve the logistic regression problem with a nonconvex regularizer:

𝑖=1 log(1 + 𝑒 𝑦𝑖𝑎 𝑖𝑥) + 𝜆 𝑑

𝑥2 𝑗 1+𝑥2 𝑗

where 𝑎𝑖 R𝑑, 𝑦𝑖 { 1, 1} are the training samples and labels with regularization hyperparameter 𝜆> 0 are chosen at 𝜆= 0.1 level. In training we use LIBSVM Chang & Lin (2011) datasets phishing, a1a, a9a. Each dataset has been split into 𝑛= 20 equal parts, each representing a different client.

Figure 1: Comparison of LAG, CLAG, EF21 and GD with Ada CGD on phishing dataset. 1 , 2 , 4 (and so on) indicates the multiplication factor we use for the optimal stepsizes predicted by theory. For example, GD-x1 means gradient decent (GD) with theoretical stepsize multiplied by 1.

Figure 1 compares the performance of our proposed algorithm, Ada CGD, with other popular 3PC methods. In order to make a fair comparison, we ﬁne-tuned the stepsize of each considered algorithm with a set of multiples of the corresponding theoretical stepsize, ranging from 20 to 28. We implemented all simulations in Python 3.8, and ran them on a cluster of 48 nodes with Intel(R) Xeon(R) Gold 6230R CPUs. To evaluate the effectiveness of the contractive compressor used in each algorithm, we chose the Top-𝑘operator as our compressor of choice. For EF21 and CLAG,

Published in Transactions on Machine Learning Research (07/2023)

we used the top-1 compressor, which has been shown to be the best in practice for these methods. For Ada CGD, we chose an array of compressors that varied from full compression (skip communication) to zero compression (sending the full gradient), with a step of 5. We used the communication cost of the algorithm as the stopping criterion for all experiments.

Our ﬁndings indicate that Ada CGD achieves comparable performance to CLAG, and in certain cases (notably the phishing dataset), it outperforms CLAG. Additionally, Ada CGD consistently outperforms LAG in all experimental settings. However, it is worth noting that on datasets such as a1a and a9a, where higher compression is more advantageous, CLAG exhibits slightly better convergence rates. These results highlight the valuable role of Ada CGD as a complement to existing 3PC methods, with the potential to enhance the convergence rate of distributed optimization algorithms. For further experimental details and results, please refer to the appendix.

Figure 2: Distribution of compressors utilized during the training process on the phishing dataset for different 𝜁values, shown as multiple pie charts. The setting is the same as for Figure 1.

Figure 2 provides insights into how adaptivity in compression is utilized during the training process. The plot displays the distribution of compressors that were selected during training by all clients. The intuition behind the use of adaptivity is that when there is a smaller gap between neighboring time step gradients, it leads to smaller updates, allowing Ada CGD to apply larger compression without losing information. Conversely, when there is a larger gap between gradients, it means more informative updates that are more susceptible to higher compression, hence requiring Ada CGD to compress less. As expected, when 𝜁= 0, Ada CGD is equivalent to EF21, which tends to choose the maximum compression level across different compressors. Conversely, with a larger 𝜁value, the Ada CGD behaves more like LAG/CLAG and has an increased number of skip connections. However, there is a range of 𝜁values (0.16, 0.54, 1.85, and 6.35) where the algorithm beneﬁts from adaptivity and utilizes the full spectrum of compressors.

7 Discussion and Limitations

The main limitations of the work are assumptions we make upon functions 𝑓𝑖of the problem 1. However, on the other hand, these assumptions govern the convergence rates we report: for example, we cannot show linear rates for convex functions due to the fundamental lower bound (Nesterov et al., 2018).

Another limitation comes from the analysis of the Bidirectional 3PC algorithm (Theorem 7). We show the analysis only for general nonconvex functions.

Published in Transactions on Machine Learning Research (07/2023)

Abdelmoniem, A. M. and Canini, M. Towards Mitigating Device Heterogeneity in Federated Learning via Adaptive Model Quantization. In Proceedings of Euro MLSys 21, Apr 2021. (Cited on 3, 4)

Agarwal, S., Wang, H., Lee, K., Venkataraman, S., and Papailiopoulos, D. Adaptive gradient communication via critical learning regime identiﬁcation. In Smola, A., Dimakis, A., and Stoica, I. (eds.), Proceedings of Machine Learning and Systems, volume 3, pp. 55 80, 2021. URL https://proceedings.mlsys.org/paper_files/paper/ 2021/file/1d7f7abc18fcb43975065399b0d1e48e-[]Paper.pdf. (Cited on 4)

Ahn, K., Yun, C., and Sra, S. Sgd with shufﬂing: optimal rates without component convexity and large epoch requirements. Advances in Neural Information Processing Systems, 33:17526 17535, 2020. (Cited on 7)

Alimohammadi, M., Markov, I., Frantar, E., and Alistarh, D. L-greco: An efﬁcient and general framework for layerwise-adaptive gradient compression, 2022. (Cited on 3, 4)

Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efﬁcient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems (NIPS), pp. 1709 1720, 2017. (Cited on 3)

Alistarh, D., Hoeﬂer, T., Johansson, M., Khirirat, S., Konstantinov, N., and Renggli, C. The convergence of sparsiﬁed gradient methods. In Advances in Neural Information Processing Systems (Neur IPS), 2018. (Cited on 3)

Arora, S., Cohen, N., and Hazan, E. On the optimization of deep networks: Implicit acceleration by overparameterization. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018. (Cited on 2)

Beznosikov, A., Horváth, S., Richtárik, P., and Safaryan, M. On biased compression for distributed learning. ar Xiv preprint ar Xiv:2002.12410, 2020. (Cited on 3)

Chang, C.-C. and Lin, C.-J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):1 27, 2011. (Cited on 9, 24)

Chen, T., Giannakis, G., Sun, T., and Yin, W. LAG: Lazily aggregated gradient for communication-efﬁcient distributed learning. Advances in Neural Information Processing Systems, 2018. (Cited on 4)

Dutta, A., Bergou, E. H., Abdelmoniem, A. M., Ho, C.-Y., Sahu, A. N., Canini, M., and Kalnis, P. On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pp. 3817 3824, 2020. (Cited on 2)

Faghri, F., Tabrizian, I., Markov, I., Alistarh, D., Roy, D. M., and Ramezani-Kebrya, A. Adaptive gradient quantization for data-parallel sgd. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 3174 3185. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 20b5e1cf8694af7a3c1ba4a87f073021-[]Paper.pdf. (Cited on 3, 4)

Fatkhullin, I., Sokolov, I., Gorbunov, E., Li, Z., and Richtárik, P. Ef21 with bells & whistles: practical algorithmic extensions of modern error feedback. ar Xiv preprint ar Xiv:2110.03294, 2021. (Cited on 2, 3)

Ghadikolaei, H. S., Stich, S., and Jaggi, M. LENA: Communication-efﬁcient distributed learning with self-triggered gradient uploads. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 3943 3951. PMLR, 2021. (Cited on 5)

Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017. (Cited on 1)

Guo, J., Liu, W., Wang, W., Han, J., Li, R., Lu, Y., and Hu, S. Accelerating distributed deep learning by adaptive gradient quantization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1603 1607, 2020. doi: 10.1109/ICASSP40776.2020.9054164. (Cited on 3)

Published in Transactions on Machine Learning Research (07/2023)

Hönig, R., Zhao, Y., and Mullins, R. D. Dadaquant: Doubly-adaptive quantization for communication-efﬁcient federated learning. Co RR, abs/2111.00465, 2021. URL https://arxiv.org/abs/2111.00465. (Cited on 3, 4, 5)

Horváth, S. and Richtárik, P. A better alternative to error feedback for communication-efﬁcient distributed learning. In 9th International Conference on Learning Representations (ICLR), 2021. (Cited on 3)

Horváth, S., Laskaridis, S., Almeida, M., Leontiadis, I., Venieris, S., and Lane, N. D. Fj ORD: Fair and accurate federated learning under heterogeneous targets with ordered dropout. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview. net/forum?id=4f Lr7H5D_e T. (Cited on 3)

Jhunjhunwala, D., Gadhikar, A., Joshi, G., and Eldar, Y. C. Adaptive quantization of model updates for communicationefﬁcient federated learning. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3110 3114, 2021. doi: 10.1109/ICASSP39728.2021.9413697. (Cited on 3, 4)

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., D Oliveira, R. G. L., Eichner, H., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., GascÃ sn, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Konecny, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Ozgur, A., Pagh, R., Raykova, M., Qi, H., Ramage, D., Raskar, R., Song, D., Song, W., Stich, S. U., Sun, Z., Suresh, A. T., TramÃ lr, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., and Zhao, S. URL https://arxiv.org/abs/1912.04977. (Cited on 1)

Karimireddy, S. P., Rebjock, Q., Stich, S., and Jaggi, M. Error feedback ﬁxes Sign SGD and other gradient compression schemes. In 36th International Conference on Machine Learning (ICML), 2019. (Cited on 3)

Khirirat, S., Feyzmahdavian, H. R., and Johansson, M. Distributed learning with compressed gradients. ar Xiv preprint ar Xiv:1806.06573, 2018. (Cited on 2, 3)

Khirirat, S., MagnÃžsson, S., Aytekin, A., and Johansson, M. A ﬂexible framework for communication-efﬁcient machine learning. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 35(9):8101 8109, May 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/16987. (Cited on 4)

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017. (Cited on 4)

Koneˇcn y, J. and Richtárik, P. Randomized distributed mean estimation: Accuracy vs. communication. Frontiers in Applied Mathematics and Statistics, 4:62, 2018. (Cited on 2)

Koneˇcn y, J., Mc Mahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efﬁciency. ar Xiv preprint ar Xiv:1610.05492, 2016. (Cited on 1)

Li, Z., Bao, H., Zhang, X., and Richtárik, P. PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International Conference on Machine Learning (ICML), 2021. ar Xiv:2008.10898. (Cited on 16)

Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, B. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018. (Cited on 1)

Mao, Y., Zhao, Z., Yan, G., Liu, Y., Lan, T., Song, L., and Ding, W. Communication efﬁcient federated learning with adaptive quantization. Co RR, abs/2104.06023, 2021. URL https://arxiv.org/abs/2104.06023. (Cited on 3, 4)

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efﬁcient learning of deep networks from decentralized data. In Artiﬁcial intelligence and statistics, pp. 1273 1282. PMLR, 2017. (Cited on 1)

Mishchenko, K., Wang, B., Kovalev, D., and Richtárik, P. Int SGD: Adaptive ﬂoatless compression of stochastic gradients. In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=p Fy Xqx Ch Zc. (Cited on 3, 4)

Nesterov, Y. et al. Lectures on convex optimization, volume 137. Springer, 2018. (Cited on 10, 26)

Published in Transactions on Machine Learning Research (07/2023)

Philippenko, C. and Dieuleveut, A. Bidirectional compression in heterogeneous settings for distributed or federated learning with partial participation: tight convergence guarantees. ar Xiv preprint ar Xiv:2006.14591, 2020. (Cited on 2)

Pishro-Nik, H. Introduction to Probability, Statistics, and Random Processes. Kappa Research, LLC, 2014. ISBN 9780990637202. URL https://books.google.com.sa/books?id=3yq_o QEACAAJ. (Cited on 15)

Qu, L., Song, S., and Tsui, C. Feddq: Communication-efﬁcient federated learning with descending quantization. Co RR, abs/2110.02291, 2021. URL https://arxiv.org/abs/2110.02291. (Cited on 3, 4, 5)

Rajput, S., Gupta, A., and Papailiopoulos, D. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964 7973. PMLR, 2020. (Cited on 7)

Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-efﬁcient federated learning method with periodic averaging and quantization. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2021 2031. PMLR, 26 28 Aug 2020. URL https:// proceedings.mlr.press/v108/reisizadeh20a.html. (Cited on 4)

Richtárik, P., Sokolov, I., and Fatkhullin, I. Ef21: A new, simpler, theoretically better, and practically faster error feedback. ar Xiv preprint ar Xiv:2106.05203, 2021. (Cited on 3, 15)

Richtárik, P., Sokolov, I., Fatkhullin, I., Gasanov, E., Li, Z., and Gorbunov, E. A. 3pc: Three point compressors for communication-efﬁcient distributed training and a better theory for lazy aggregation. Co RR, abs/2202.00998, 2022. URL https://arxiv.org/abs/2202.00998. (Cited on 1, 4, 5, 6, 8, 9, 16, 22, 23, 24)

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. (Cited on 2, 3)

Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsiﬁed SGD with memory. In Advances in Neural Information Processing Systems (Neur IPS), 2018. (Cited on 3)

Sun, J., Chen, T., Giannakis, G., and Yang, Z. Communication-efﬁcient distributed learning via lazily aggregated quantized gradients. Advances in Neural Information Processing Systems, 32:3370 3380, 2019. (Cited on 3, 5)

Suresh, A. T., Felix, X. Y., Kumar, S., and Mc Mahan, H. B. Distributed mean estimation with limited communication. In International Conference on Machine Learning, pp. 3329 3337. PMLR, 2017. (Cited on 2)

Szlendak, R., Tyurin, A., and Richtárik, P. Permutation compressors for provably faster distributed nonconvex optimization. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=Gug Z5Dzz Au. (Cited on 7)

Tang, H., Lian, X., Yu, C., Zhang, T., and Liu, J. Double Squeeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2020. (Cited on 2)

Vaswani, S., Bach, F., and Schmidt, M. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 1195 1204. PMLR, 2019. (Cited on 1)

Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 89fcd07f20b6785b92134bd6c1d0fa42-[]Paper.pdf. (Cited on 3)

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes. ar Xiv preprint ar Xiv:1904.00962, 2019. (Cited on 1)

Published in Transactions on Machine Learning Research (07/2023)

Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., and Zhang, C. Zip ML: Training linear models with end-to-end low precision, and a little bit of deep learning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 4035 4043. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr.press/v70/zhang17e.html. (Cited on 2)

Zhao, Z., Mao, Y., Zeeshan, M., Liu, Y., Lan, T., and Ding, W. AQUILA: Communication efﬁcient federated learning with adaptive quantization of lazily-aggregated gradients, 2022. URL https://openreview.net/forum? id=cd ZLe5S0ur. (Cited on 3, 4)

Zhong, Y., Xie, C., Zheng, S., and Lin, H. Compressed communication for distributed training: Adaptive methods and system, 2021. (Cited on 4)

Published in Transactions on Machine Learning Research (07/2023)

In Appendix A,we state the basic facts needed for detailed proofs of the propositions. In Appendix B, we provide the proofs missing in the main part of the paper. Appendix C contains experimental details and extra experiments. We brieﬂy discuss the main limitations of the paper in Appendix D.

A Basic facts

We start the appendix with common math facts. Lemmas 3 and 4 present classic Cauchy-Schwartz inequality for vectors in metric space and random variables in probabilistic space, respectively. Lemma 5 shows a classic upper bound on quadratics. Lemma 6 provides a sufﬁcient condition that ensures a quadratic inequality appearing in our convergence proofs holds.

Lemma 3 (Cauchy-Schwartz inequality for arbitrary vectors). Let 𝑥, 𝑦 R𝑑be arbitrary vectors. Then, the following inequality holds

| 𝑥, 𝑦 | 𝑥 𝑦 , (16)

where , and denote the inner product and the induced norm, respectively.

Lemma 4 (Cauchy-Schwartz inequality for random variables; section 6.2.4 of (Pishro-Nik, 2014)). For any two random variables 𝑋and 𝑌, we have

E[𝑋2]E[𝑌2], (17)

where equality holds if and only if 𝑋= 𝛼𝑌, for some constant 𝛼 R.

Lemma 5. Let 𝑎, 𝑏, 𝑐, 𝑑 R𝑑be arbitrary vectors. Then, the following inequalities hold

𝑎 𝑏 2 2( 𝑎 𝑐 2 + 𝑐 𝑏 2), (18)

𝑎 𝑏 2 3( 𝑎 𝑐 2 + 𝑐 𝑑 2 + 𝑑 𝑏 2). (19)

Lemma 6 (Lemma 5 of (Richtárik et al., 2021)). If 0 < 𝛾 1 𝑎+𝑏, then 𝑎𝛾2 + 𝑏𝛾 1. Moreover, the bound is tight up to the factor of 2 since 1 𝑎+𝑏 min{ 1 𝑎, 1

Published in Transactions on Machine Learning Research (07/2023)

B Proofs for Sections 4 and 5

B.1 Lemma 1

At ﬁrst glance, Ada CGD does not seem to be an Ada3PC compressor. However, we can construct an Ada3PC compressor, which is equivalent to Ada CGD. Lemma 1. Ada CGD is a special case of Ada3PC compressor.

Proof. Let us consider the following Ada3PC compressor constructed from one LAG compressor and 𝑚EF21 compressors.

{ ℎ if 𝑥 ℎ 2 𝜁 𝑥 𝑦 2, 𝑥 otherwise. if 𝑥 ℎ 2 𝜁 𝑥 𝑦 2,

𝒞EF ,1 ℎ,𝑦 (𝑥) else if 𝑥 𝒞EF ,1 ℎ,𝑦 (𝑥) 2 𝜁 𝑥 𝑦 2, . . . , 𝒞EF ,𝑚 1 ℎ,𝑦 (𝑥) else if 𝑥 𝒞EF ,𝑚 1 ℎ,𝑦 (𝑥) 2 𝜁 𝑥 𝑦 2, 𝒞EF ,𝑚 ℎ,𝑦 (𝑥) otherwise.

If 𝑥 ℎ 2 𝜁 𝑥 𝑦 2, then 𝒞ℎ,𝑦applies the LAG compressor to 𝑥. This LAG compressor in turn outputs ℎ, as it does 𝒞AC ℎ,𝑦for the same condition. If the opposite is true, i.e., 𝑥 ℎ 2 > 𝜁 𝑥 𝑦 2, 𝒞ℎ,𝑦checks the same conditions and chooses the same compressor as 𝒞AC ℎ,𝑦. Thus, 𝒞AC ℎ,𝑦is equivalent to Ada3PC compressor 𝒞ℎ,𝑦.

B.2 Theorem 6

The proof of Theorem 6 requires several auxiliary results. Lemma 7 states the descent lemma typical for the analysis of biased compressors. Lemma 8 shows how individual 3PC compressors, applied at clients, affect the aggregated divergence of gradient estimates from gradients. Lemma 9 presents a technical upper bound on Lyapunov function Ψ𝑡. Lemma 7 (Lemma 2 of (Li et al., 2021)). Suppose the function 𝑓is 𝐿 -smooth and 𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡, where 𝑔𝑡 R𝑑

is any vector, and 𝛾> 0 is any scalar. Then we have

𝑓(𝑥𝑡+1) 𝑓(𝑥𝑡) 𝛾

2 𝑓(𝑥𝑡) 2 ( 1

) 𝑥𝑡+1 𝑥𝑡 2 + 𝛾

2 𝑔𝑡 𝑓(𝑥𝑡) 2. (20)

Lemma 8 (Lemma B.3 of (Richtárik et al., 2022)). Let Assumption 3 hold. Consider Algorithm 1 with 3PC compressor ℳW and identity compressor ℳM. Then for all 𝑡 0 the sequence

𝑖=1 𝑓𝑖(𝑥𝑡) 𝑔𝑡 𝑖 2 (21)

satisﬁes E [ 𝐺𝑡+1] (1 𝐴)E [ 𝐺𝑡] + 𝐵𝐿2 +E [ 𝑥𝑡+1 𝑥𝑡 2] , (22)

where 𝐴and 𝐵are parameters of ℳW. Lemma 9. Let Assumption 1 hold. Let Lyapunov function Ψ𝑡:= 𝑓(𝑥𝑡) 𝑓* + 𝛾

𝐴𝐺𝑡. Then, for any 𝑡 0, the following inequality holds

EΨ𝑡 ( E [ 𝑓(𝑥𝑡) 2] + 𝛾

𝐴E𝐺𝑡 ) ( E [ 𝑥𝑡 𝑥 2] + 𝛾

𝐴E [𝐺𝑡] ) , (23)

where 𝑥* is any point belonging to Argmin 𝑓(𝑥).

Proof. By deﬁnition of convexity we get

EΨ𝑡 = E𝑓(𝑥𝑡) 𝑓* + 𝛾

(12) E 𝑓(𝑥𝑡),𝑥𝑡 𝑥 + 𝛾

= E [ 𝑓(𝑥𝑡), 𝛾

𝐴E𝐺𝑡 ] , [ 𝑥𝑡 𝑥*, 𝛾

Published in Transactions on Machine Learning Research (07/2023)

By applying Cauchy-Schwartz inequality on vectors and random variables we ﬁnish the proof

E [ 𝑓(𝑥𝑡), 𝛾

𝐴E𝐺𝑡 ] , [ 𝑥𝑡 𝑥*, 𝛾

𝑓(𝑥𝑡) 2 + 𝛾

(17) ( E [ 𝑓(𝑥𝑡) 2] + 𝛾

𝐴E𝐺𝑡 ) ( E [ 𝑥𝑡 𝑥 2] + 𝛾

𝐴E [𝐺𝑡] ) .

Now we are ready to prove the main theorem.

Theorem 6. Let Assumptions 1, 2, 3 and 4 hold. Assume the stepsize 𝛾of algorithm satisﬁes 0 < 𝛾 1/𝑀, where

𝐴. Then, for any 𝑇 0 we have

E [ 𝑓(𝑥𝑇) ] 𝑓(𝑥 ) max { 1

} 2(Ω2 + Ψ0)

Proof. Combining Lemma 7, Jensen s inequality , we get

𝑓(𝑥𝑡+1) 𝑓(𝑥𝑡) 𝛾

2 𝑓(𝑥𝑡) 2 ( 1

) 𝑥𝑡+1 𝑥𝑡 2 + 𝛾

2 𝑓(𝑥𝑡) 2 ( 1

) 𝑥𝑡+1 𝑥𝑡 2 + 𝛾

𝑖=1 𝑔𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2

2 𝑓(𝑥𝑡) 2 ( 1

) 𝑥𝑡+1 𝑥𝑡 2 + 𝛾

Now applying Equation (24) and Lemma 8 on the difference of Lyapunov functions, we obtain

E [ Ψ𝑡+1] E [ Ψ𝑡] = E [ 𝑓(𝑥𝑡+1) 𝑓(𝑥𝑡) ] + 𝛾

𝐴E [ 𝐺𝑡+1] 𝛾

2 E [ 𝑓(𝑥𝑡) 2] ( 1

) E [ 𝑥𝑡+1 𝑥𝑡 2] + 𝛾

𝐴E [ 𝐺𝑡+1] 𝛾

2 E [ 𝑓(𝑥𝑡) 2] ( 1

) E [ 𝑥𝑡+1 𝑥𝑡 2]

𝐴 [ (1 𝐴)E[𝐺𝑡] + 𝐵𝐿2 +E 𝑥𝑡+1 𝑥𝑡 2 E[𝐺𝑡] ] .

Rearranging the term, we get

E [ Ψ𝑡+1] E [ Ψ𝑡] 𝛾

2 [ 𝑓(𝑥𝑡) 2] ( 1

) E [ 𝑥𝑡+1 𝑥𝑡 2] 𝐴

2 𝛾 𝐴E [ 𝐺𝑡] .

We further note that

2 𝛾𝐵𝐿2 + 𝐴 0 𝐿2 + 2𝐵

𝐴𝛾2 + 𝐿 𝛾 1 𝐿𝑒𝑚𝑚𝑎6 𝛾 1

Appropriately chosen stepsize gives

E [ Ψ𝑡+1] E [ Ψ𝑡] min { 𝛾

} ( E [ 𝑓(𝑥𝑡) 2] + 𝛾

𝐴E [ 𝐺𝑡] ) .

Published in Transactions on Machine Learning Research (07/2023)

Rearranging the terms, we have

E [ 𝑓(𝑥𝑡) 2] + 𝛾

𝐴E [ 𝐺𝑡] E [Ψ𝑡] E [ Ψ𝑡+1]

from what we deduce that E [ Ψ𝑡+1] E [Ψ𝑡].

Using Lemma 9 and (25), we have

EΨ𝑡+1EΨ𝑡 ( EΨ𝑡) 2 ( E [ 𝑓(𝑥𝑡) 2] + 𝛾

𝐴E𝐺𝑡) ( E [ 𝑥𝑡 𝑥 2] + 𝛾

E [ 𝑥𝑡 𝑥 2] + 𝛾

2 } ( E [ Ψ𝑡] E [ Ψ𝑡+1] ) .

Using that 𝛾

𝐴E [𝐺𝑡] EΨ𝑡 Ψ0 and E [ 𝑥𝑡 𝑥 2] Ω2, we obtain

EΨ𝑡+1EΨ𝑡 Ω2 + Ψ0

2 } ( E [ Ψ𝑡] E [ Ψ𝑡+1] ) .

Rearranging again, we get

Ω2 + Ψ0 ( 1 E [Ψ𝑡+1] 1 E [Ψ𝑡]

Summing up from 𝑡= 0 to 𝑡= 𝑇 1, we ﬁnish the proof

E [ 𝑓(𝑥𝑇) ] 𝑓(𝑥 ) E [ Ψ𝑇] max { 2

B.3 Theorem 7

Algorithm 2 3PC-BD (Bidirectional 3PC algorithm)

1: Input: starting point 𝑥0 R𝑑; 𝑔0, 𝑔0 𝑖 R𝑑for 𝑖= 1, , 𝑛(known by nodes), 𝑔0 = 1

𝑖=1 𝑔0 𝑖(known by master);

learning rate 𝛾> 0. 2: for 𝑡= 0,1,2, , 𝑇 1 do 3: Broadcast 𝑔𝑡to all workers

4: for for all devices 𝑖= 1, . . . , 𝑛in parallel do 5: 𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡

6: 𝑔𝑡+1 𝑖 = 𝒞𝑤 𝑔𝑡 𝑖, 𝑓𝑖(𝑥𝑡)( 𝑓𝑖(𝑥𝑡+1))

7: Communicate 𝑔𝑡+1 𝑖 to the server 8: end for

9: 𝑔𝑡+1 = 1

10: 𝑔𝑡+1 = 𝒞𝑀 𝑔𝑡, 𝑔𝑡( 𝑔𝑡+1) 11: end for

Published in Transactions on Machine Learning Research (07/2023)

For Theorem 7, we assume that both compressors ℳW and ℳM in Algorithm 1 are 3PC compressors. The main steps of the algorithm are:

𝑥𝑡+1 = 𝑥𝑡 𝛾𝑔𝑡

𝑔𝑡+1 𝑖 = 𝒞𝑤 𝑔𝑡 𝑖, 𝑓𝑖(𝑥𝑡)( 𝑓𝑖(𝑥𝑡+1))

𝑔𝑡+1 = 𝒞𝑀 𝑔𝑡, 𝑔𝑡( 𝑔𝑡+1)

Unlike in the previous subsection, we use additional notations: 𝑃𝑡 𝑖 := 𝑔𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2 , 𝑃𝑡 := 1 𝑛

𝑖=1 𝑃𝑡 𝑖and

𝑅𝑡:= 𝑥𝑡+1 𝑥𝑡 2.

Lemma 10 is an analogue of Lemma 8 (in bidirectional case we need slightly different arguments at some steps). Lemma 11 is another technical lemma that upper bounds E [ 𝑔𝑡 𝑔𝑡] 2.

Lemma 10. Let Assumption 3 hold, 𝒞𝑤be a 3PC compressor, and 𝑔𝑡+1 𝑖 be an 3PC-BD estimator of 𝑓𝑖(𝑥𝑡+1), 𝑖.𝑒.

𝑔𝑡+1 𝑖 = 𝒞𝑤 𝑔𝑡 𝑖, 𝑓𝑖(𝑥𝑡)( 𝑓𝑖(𝑥𝑡+1)) (27)

for arbitrary 𝑔0 𝑖for all 𝑖 [𝑛], 𝑡 0. Then

E [ 𝑃𝑡+1] (1 𝐴W)E [ 𝑃𝑡] + 𝐵W𝐿2 +E [ 𝑅𝑡] . (28)

Proof. Deﬁne 𝑊𝑡:= { 𝑔𝑡 1, , 𝑔𝑡 𝑛, 𝑥𝑡,𝑥𝑡+1}, then

E [ 𝑃𝑡+1 𝑖 ] = E [ E [ 𝑃𝑡+1 𝑖 | 𝑊𝑡] ]

= E [ E [ 𝑔𝑡+1 𝑖 𝑓𝑖(𝑥𝑡+1) 2 | 𝑊𝑡] ]

= E [ E [ 𝒞𝑤 𝑔𝑡 𝑖, 𝑓𝑖(𝑥𝑡)( 𝑓𝑖(𝑥𝑡+1)) 𝑓𝑖(𝑥𝑡+1) 2 | 𝑊𝑡 ] ]

(6) (1 𝐴W)E [ 𝑔𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] + 𝐵WE [ 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) 2]

Averaging the above inequalities over 𝑖 [𝑛], we obtain (28). Indeed,

E [ 𝑃𝑡+1] = E

𝑖=1 E [ 𝑃𝑡+1 𝑖 ]

𝑖=1 (1 𝐴W)E [ 𝑔𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2] + 1

𝑖=1 𝐵WE [ 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) 2]

= (1 𝐴W)E [ 𝑃𝑡] + 𝐵W 1

𝑖=1 E [ 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) 2]

𝐴𝑠𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛3 (1 𝐴W)E [ 𝑃𝑡] + 𝐵W𝐿2 +E 𝑥𝑡+1 𝑥𝑡 2

= (1 𝐴W)E [ 𝑃𝑡] + 𝐵W𝐿2 +E [ 𝑅𝑡] .

Lemma 11. Let Assumptions 3 and 5 hold, 𝒞𝑀, 𝒞𝑤be 3PC compressors. Let 𝑔𝑡+1 𝑖 be an 3PC-BD estimator of 𝑓𝑖(𝑥𝑡+1), i.e. 𝑔𝑡+1 𝑖 = 𝒞𝑤 𝑔𝑡 𝑖, 𝑓𝑖(𝑥𝑡)( 𝑓𝑖(𝑥𝑡+1)) (30)

Published in Transactions on Machine Learning Research (07/2023)

and let 𝑔𝑡+1 be an 3PC-BD estimator of 𝑔𝑡+1 = 1

𝑖=1 𝑔𝑡+1 𝑖 , i.e.

𝑔𝑡+1 𝑖 = 𝒞𝑀 𝑔𝑡, 𝑔𝑡( 𝑔𝑡+1) (31)

for arbitrary 𝑔0, 𝑔0 𝑖for all 𝑖 [𝑛], 𝑡 0. Then

E [ 𝑔𝑡+1 𝑔𝑡+1 2] (1 𝐴M)E [ 𝑔𝑡 𝑔𝑡 2] + 3𝐵M(2 𝐴W)E [ 𝑃𝑡] + 3𝐵M(𝐵W + 1)𝐿2 +E [ 𝑅𝑡] , (32)

where 𝑔𝑡= 1

𝑛 𝑛 𝑖=1 𝑔𝑡 𝑖, 𝑔𝑡= 1

𝑛 𝑛 𝑖=1 𝑔𝑡 𝑖.

Proof. Similarly to the proof of Lemma 10, we deﬁne 𝑊𝑡:= {𝑔𝑡 1, , 𝑔𝑡 𝑛, 𝑥𝑡,𝑥𝑡+1} and bound E [ 𝑔𝑡+1 𝑔𝑡+1 2] :

E [ 𝑔𝑡+1 𝑔𝑡+1 2] = E [ E [ 𝑔𝑡+1 𝑔𝑡+1 2 | 𝑊𝑡] ]

= E [ E [ 𝒞𝑀 𝑔𝑡, 𝑔𝑡( 𝑔𝑡+1) 𝑔𝑡+1 2 | 𝑊𝑡] ]

(6) (1 𝐴M)E [ 𝑔𝑡 𝑔𝑡 2] + 𝐵ME [ 𝑔𝑡+1 𝑔𝑡 2] , (33)

Further, we bound the last term in (33). Recall that

𝑖=1 𝑔𝑡+1 𝑖 = 1

𝑖=1 𝒞𝑤 𝑔𝑡 𝑖, 𝑓𝑖(𝑥𝑡)( 𝑓𝑖(𝑥𝑡+1)). (34)

E [ 𝑔𝑡+1 𝑔𝑡 2] = E

𝑖=1 𝒞𝑤 𝑔𝑡 𝑖, 𝑓𝑖(𝑥𝑡)( 𝑓𝑖(𝑥𝑡+1)) 𝑔𝑡 𝑖

𝑖=1 E [ 𝒞𝑤 𝑔𝑡 𝑖, 𝑓𝑖(𝑥𝑡)( 𝑓𝑖(𝑥𝑡+1)) 𝑔𝑡 𝑖 2]

𝑖=1 E [ 𝒞𝑤 𝑔𝑡 𝑖, 𝑓𝑖(𝑥𝑡)( 𝑓𝑖(𝑥𝑡+1)) 𝑓𝑖(𝑥𝑡+1) 2]

𝑖=1 E [ 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) 2] + 3

𝑖=1 E [ 𝑓𝑖(𝑥𝑡) 𝑔𝑡 𝑖 2]

(6) 3(1 𝐴W) 1

𝑖=1 E [ 𝑓𝑖(𝑥𝑡) 𝑔𝑡 𝑖 2] + 3𝐵W 1

𝑖=1 E [ 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) 2]

𝑖=1 E [ 𝑓𝑖(𝑥𝑡+1) 𝑓𝑖(𝑥𝑡) 2] + 3

𝑖=1 E [ 𝑓𝑖(𝑥𝑡) 𝑔𝑡 𝑖 2]

𝐴𝑠𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛3 3(2 𝐴W)E [ 𝑃𝑡] + 3(𝐵W + 1)𝐿2 +E [ 𝑥𝑡+1 𝑥𝑡 2]

= 3(2 𝐴W)E [ 𝑃𝑡] + 3(𝐵W + 1)𝐿2 +E [ 𝑅𝑡] , (35)

where the ﬁrst inequality follows from Young s inequality. Plugging (35) into (33) we ﬁnish the proof:

E [ 𝑔𝑡+1 𝑔𝑡+1 2] (1 𝐴M)E [ 𝑔𝑡 𝑔𝑡 2] + 3𝐵M(2 𝐴W)E [ 𝑃𝑡]

+ 3𝐵M(𝐵W + 1)𝐿2 +E [ 𝑅𝑡] .

Published in Transactions on Machine Learning Research (07/2023)

Having proved the previous lemmas, we can now show the convergence of bidirectional 3PC algorithm.

Theorem 7. Let Assumptions 3 and 5 hold, and let the stepsize in Algorithm 2 be set as

6𝐵M(𝐵W + 1)

( 1 + 3𝐵M(2 𝐴W)

Fix 𝑇and let 𝑥𝑇be chosen uniformly from {𝑥0,𝑥1, ,𝑥𝑇 1} uniformly at random. Then

E [ 𝑓( 𝑥𝑇) 2] 2Ψ0

where Ψ𝑇= 𝑓(𝑥𝑡) 𝑓inf + 𝛾 𝐴M 𝑔𝑡 𝑔𝑡 2 + 𝛾 𝐴W ( 1 + 3𝐵M(2 𝐴W)

𝐴M ) 1 𝑛 𝑛 𝑖=1 𝑔𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2.

Proof. We apply Lemma 7 and split the error 𝑔𝑡 𝑓(𝑥𝑡) 2 into two parts

𝑓(𝑥𝑡+1) (20) 𝑓(𝑥𝑡) 𝛾

2 𝑓(𝑥) 2 ( 1

(18) 𝑓(𝑥𝑡) 𝛾

2 𝑓(𝑥) 2 ( 1

) 𝑅𝑡+ 𝛾 𝑔𝑡 𝑓(𝑥𝑡) 2 + 𝛾 𝑔𝑡 𝑔𝑡 2

2 𝑓(𝑥) 2 ( 1

𝑔𝑡 𝑖 𝑓𝑖(𝑥𝑡) 2 + 𝛾 𝑔𝑡 𝑔𝑡 2

2 𝑓(𝑥) 2 ( 1

) 𝑅𝑡+ 𝛾𝑃𝑡+ 𝛾 𝑔𝑡 𝑔𝑡 2 , (38)

where in the last inequality we applied Young s inequality. Subtracting 𝑓inf from both sides of the above inequality, taking expectation and using the notation 𝛿𝑡= 𝑓(𝑥𝑡) 𝑓inf, we get

E [ 𝛿𝑡+1] E [ 𝛿𝑡] 𝛾

2 E [ 𝑓(𝑥𝑡) 2] ( 1

) E [ 𝑅𝑡] + 𝛾E [ 𝑃𝑡] + 𝛾E [ 𝑔𝑡 𝑔𝑡 2] . (39)

Further, Lemmas 10 and 11 provide the recursive bounds for the last two terms of (39)

E [ 𝑃𝑡+1] (1 𝐴W)E [ 𝑃𝑡] + 𝐵W𝐿2 +E [ 𝑅𝑡] , (40)

E [ 𝑔𝑡+1 𝑔𝑡+1 2] (1 𝐴M)E [ 𝑔𝑡 𝑔𝑡 2] + 3𝐵M(2 𝐴W)E [ 𝑃𝑡]

+ 3𝐵M(𝐵W + 1)𝐿2 +E [ 𝑅𝑡] . (41)

Summing up (39) with a 𝛾 𝐴M multiple of (41) we obtain

E [ 𝛿𝑡+1] + 𝛾

𝐴M E [ 𝑔𝑡 𝑔𝑡 2] E [ 𝛿𝑡] 𝛾

2 E [ 𝑓(𝑥𝑡) 2] ( 1

+ 𝛾E [ 𝑃𝑡] + 𝛾E [ 𝑔𝑡 𝑔𝑡 2]

( (1 𝐴M)E [ 𝑔𝑡 𝑔𝑡 2] )

𝐴M ( 3𝐵M(2 𝐴W)E [ 𝑃𝑡] + 3𝐵M(𝐵W + 1)𝐿2 +E [ 𝑅𝑡] )

2 E [ 𝑓(𝑥𝑡) 2] + 𝛾

𝐴M E [ 𝑔𝑡 𝑔𝑡 2]

2 3𝛾𝐵M(𝐵W + 1)𝐿2 + 𝐴M

+ 𝛾 ( 1 + 3𝐵M(2 𝐴W)

) E [ 𝑃𝑡] .

Published in Transactions on Machine Learning Research (07/2023)

Then adding the above inequality with a 𝛾 𝐴W ( 1 + 3𝐵M(2 𝐴W)

𝐴M ) multiple of (40), we get

E [ Ψ𝑡+1] = E [ 𝛿𝑡+1] + 𝛾

𝐴M E [ 𝑔𝑡 𝑔𝑡 2] + 𝛾

( 1 + 3𝐵M(2 𝐴W)

) E [ 𝑃𝑡+1]

2 E [ 𝑓(𝑥𝑡) 2] + 𝛾

𝐴M E [ 𝑔𝑡 𝑔𝑡 2]

2 3𝛾𝐵M(𝐵W + 1)𝐿2 + 𝐴M

) E [ 𝑅𝑡] + 𝛾 ( 1 + 3𝐵M(2 𝐴W)

( 1 + 3𝐵M(2 𝐴W)

) ( (1 𝐴W)E [ 𝑃𝑡] + 𝐵W𝐿2 +E [ 𝑅𝑡] )

E [ 𝛿𝑡] + 𝛾

𝐴M E [ 𝑔𝑡 𝑔𝑡 2] + 𝛾

( 1 + 3𝐵M(2 𝐴W)

) E [ 𝑃𝑡] 𝛾

2 E [ 𝑓(𝑥𝑡) 2]

2 3𝛾𝐵M(𝐵W + 1)𝐿2 + 𝐴M 𝛾𝐵W𝐿2 + 𝐴W

( 1 + 3𝐵M(2 𝐴W)

) ) E [ 𝑅𝑡] . (42)

Thus by Lemma 6 and the choice of the stepsize

6𝐵M(𝐵W + 1)

( 1 + 3𝐵M(2 𝐴W)

the last term in (42) is not positive. By summing up inequalities for 𝑡= 0, 1, , 𝑇 1, we get

0 E [ Ψ𝑇] Ψ0 𝛾

𝑖=1 E [ 𝑓(𝑥𝑡) 2] .

Multiplying both sides by 2 𝛾𝑇and rearranging we get

𝑖=1 E [ 𝑓(𝑥𝑡) 2] 2Ψ0

B.4 Convergence for general nonconvex functions

The results in two subsequent subsections set ℳW as a 3PC compressor and ℳM as an indentity one. According to Lemma 2, Adaptive 3PC is a 3PC compressor. Thus, convergence results from (Richtárik et al., 2022) are valid for Adaptive 3PC compressor. It leads us to the following corollary.

Corollary 3 (Corollary 5.6 of (Richtárik et al., 2022)). Let Assumptions 2, 3 and 5 hold. Let ℳW and ℳM in Algorithm 1 be Ada3PC and identity compressors, respectively, and choose the stepsize 𝛾= 1 𝐿 +𝐿+

𝐴min . Then, for any 𝑇 1

E [ 𝑓( 𝑥𝑇) 2] 2(𝑓(𝑥0) 𝑓(𝑥inf)) ( 𝐿 + 𝐿+

𝑛 𝑛 𝑖=1 𝑔0 𝑖 𝑓𝑖(𝑥0) 2]

That is, to achieve E [ 𝑓( 𝑥𝑇) 2] 𝜀2 for some 𝜀> 0, Algorithm 1 requires

2(𝑓(𝑥0) 𝑓(𝑥inf)) ( 𝐿 + 𝐿+

𝑛 𝑛 𝑖=1 𝑔0 𝑖 𝑓𝑖(𝑥0) 2]

iterations.

Published in Transactions on Machine Learning Research (07/2023)

B.5 Convergence for PŁnonconvex functions

The setup here is the same as in the previous subsection, except we add the following assumption.

Assumption 6 (PŁ condition). Function 𝑓: R𝑑 R satisﬁes the Polyak-Łojasiewicz (PŁ) condition with parameter 𝜇> 0, i.e., 𝑓(𝑥) 2 2𝜇(𝑓(𝑥) 𝑓*) 𝑥 R𝑑,

where 𝑥* := arg min 𝑥 R𝑑 𝑓(𝑥) and 𝑓* := 𝑓(𝑥*).

Corollary 4 (Corollary 5.9 of (Richtárik et al., 2022)). Let Assumptions 2, 3, 5 and 6 hold. Let ℳW and ℳM

in Algorithm 1 be Ada3PC and identity compressors, respectively, and choose the stepsize

Then, to achieve E [ 𝑓(𝑥𝑇) ] 𝑓* 𝜀for some 𝜀> 0 the method requires

𝐴min 𝜇 , 𝐴min

log 𝑓(𝑥0) 𝑓(𝑥inf) + E [ 1

𝑛 𝑛 𝑖=1 𝑔0 𝑖 𝑓𝑖(𝑥0) 2𝛾/𝐴min ]

iterations.

Published in Transactions on Machine Learning Research (07/2023)

C Experimental details and extra experiments

All simulations are implemented in Python 3.8 and run on Intel(R) Xeon(R) Gold 6230R CPU cluster with 48 nodes. We ﬁne-tune the stepsize of each considered algorithm with (20, 21, . . . , 28) multiples of the corresponding theoretical stepsize. As contractive compressor we use Top-𝑘operator. For EF21 and CLAG we use top-1 compressor, which usually the best in practice for these methods. For Ada CGD we choose compressors varying from full compression (skip communication) to zero compression of features (sending full gradient). In order to provide fair comparisons, we choose master compressor ℳ𝑀as identity operator in these experiments. For the stopping criterion we choose communication cost of the algorithm.

We use the setup described in Richtárik et al. (2022), namely logistic regression with nonconvex regularizer:

𝑖=1 log(1 + 𝑒 𝑦𝑖𝑎 𝑖𝑥) + 𝜆 𝑑

𝑥2 𝑗 1+𝑥2 𝑗

where 𝑎𝑖 R𝑑, 𝑦𝑖 { 1, 1} are the training samples and labels with regularization hyperparameter 𝜆> 0 chosen at 𝜆= 0.1 level. We solve this problem using LIBSVM Chang & Lin (2011) datasets phishing, a1a, a9a. Each dataset has been evenly split into 𝑛= 20 equal parts where each part represents a separate client. Figures 3-6 compare Ada CGD with LAG, EF21 and their generalization CLAG. Figure 4 provides the comparison in convex regime, i.e. 𝜆= 0, for phishing dataset. In the experiments, Ada CGD is shown to be comparable and in some cases superior to CLAG and always superior to LAG. In other words, Ada CGD efﬁciently complements CLAG and other 3PC methods.

Figure 3: Comparison of LAG, CLAG, EF21 and GD with Ada CGD on phishing dataset.

Figure 4: Comparison of LAG, CLAG, EF21 and GD with Ada CGD on phishing dataset in the convex regime i.e. 𝜆= 0.

Published in Transactions on Machine Learning Research (07/2023)

Figure 5: Comparison of LAG, CLAG, EF21 and GD with Ada CGD on a1a dataset.

Figure 6: Comparison of LAG, CLAG, EF21 and GD with Ada CGD on a9a dataset.

Published in Transactions on Machine Learning Research (07/2023)

D Limitations

The main limitations of the work are assumptions we make upon functions 𝑓𝑖of the problem 1. But, on the other hand, these assumptions govern the convergence rates we report: for example, we cannot show linear rate for convex functions due to the fundamental lower bound (Nesterov et al., 2018).

Another limitation comes from the analysis of Bidirectional 3PC algorithm (Theorem 7). We show the analysis only for general nonconvex functions.