# federated_learning_under_arbitrary_communication_patterns__4f431f80.pdf Federated Learning under Arbitrary Communication Patterns Dmitrii Avdyukhin 1 Shiva Prasad Kasiviswanathan 2 Abstract The canonical federated learning problem involves learning Federated Learning is a distributed learning set ting where the goal is to train a centralized model with training data distributed over a large num ber of heterogeneous clients, each with unreli able and relatively slow network connections. A common optimization approach used in feder ated learning is based on the idea of local SGD: each client runs some number of SGD steps lo cally and then the updated local models are av eraged to form the updated global model on the coordinating server. In this paper, we investigate the performance of an asynchronous version of local SGD wherein the clients can communicate with the server at arbitrary time intervals. Our main result shows that for smooth strongly con vex and smooth nonconvex functions we achieve convergence rates that match the synchronous version that requires all clients to communicate simultaneously. 1. Introduction Federated learning (FL) is a distributed machine learning setting that aims to collaboratively train a model under the orchestration of a central server. Practical applications of FL range from cross-device scenarios, where a huge number of typically unreliable clients with small quantities of data per client participate, to cross-silo scenarios with smaller numbers of reliable clients, each possessing larger quantities of data (Kairouz et al., 2019). Typically, in a FL application the clients perform most of the computation, and a central parameter server updates the model parame ters using the information returned by the clients. Without explicit sharing of data from clients, FL can mitigate some of the privacy risks associated with traditional distributed learning techniques. 1Department of Computer Science, Indiana University, Bloomington, IN, USA 2Amazon, Palo Alto, CA, USA. Corre spondence to: Dmitrii Avdyukhin , Shiva Kasiviswanathan . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). a single, global statistical model from data stored on lots of remote devices. In particular, the goal is typically to minimize the following objective function: N X 1 f (i)(x), min f(x) where f(x) = (1) x N i=1 where each f (i) is based on data available on client i. Here, N is the number of clients. Federated learning brings in some unique characteristics in solving the optimization problem posed in (1), which also makes the federated learn ing distinct from traditional distributed learning. (i) Communication is a critical bottleneck in federated set tings. Federated networks are potentially comprised of a massive number of devices, and communication in the network can be slower than local computation by many orders of magnitude (Kairouz et al., 2019). (ii) Devices frequently generate and collect data in a var ied manner, e.g., mobile phone users may use language differently which might affect the next word predic tion task. This means that the training data are nonidentically distributed, that is, a device s local data can not be regarded as samples drawn from the overall dis tribution. (iii) One fnal difference is that unlike traditional distributed learning systems, in the FL setting the server has no con trol over users devices. For example, when a Wi Fi ac cess on a device is temporarily unavailable, the device may not communicate with the server for many rounds. 1.1. Our Model and Results We propose a federated learning algorithm that tries to ad dresses all the three challenges laid above. In particular, we get away from commonly used two impractical assump tions: (a) identical data distribution across clients and (b) all clients can synchronize and communicate periodically or as demanded by the server. With the goal of increasing the compute to communication ratio, a common idea in federated/distributed setup is that instead of keeping the iterates on different clients in sync, we allow them to evolve locally on each machine, inde pendent from each other, and only average the sequences Federated Learning under Arbitrary Communication Patterns Table 1: Comparison of our bounds with existing results (under similar assumptions) to reach asymptotically the same statistical term as the convergence rate of minibatch SGD. Larger Δ is preferable for reducing communication. Basu et al. (2019) results also consider additional gradient compression techniques, which are ignored here for a direct comparison. Assumption on f Bound on Δ with sync. local SGD p Previous Bound on Δ with async. client updates Our Bound on Δ with async. client updates Smooth, Strongly Convex O( T/N) (Basu et al., 2019) O((T/N) 1/4) (Basu et al., 2019) Smooth, Nonconvex 1/4 O(T 3/4 ) /N (Yu et al., 2019b; Basu et al., 2019) 1/8 O(T 3/8) /N (Basu et al., 2019) 1/4 O(T 3/4 ) /N once per several iterations. Such a strategy is commonly referred to as local SGD (Mangasarian, 1995; Zinkevich et al., 2010; Coppola, 2015; Stich, 2018; Zhou & Cong, 2018), but is also known in the literature under various other names (such as parallel SGD) (Yu et al., 2019b). In the simplest version of synchronous local SGD, each client performs local SGD updates in parallel on the local data, and the server averages all clients iterates after every Δ updates.1 In this paper, we use an asynchronous model for client communication, where all the client iterates evolve at the same rate, but communicate with the server at arbitrary times decided individually by each client2 In this asyn chronous model, variants of which were recently consid ered by (Stich, 2018; Basu et al., 2019; Stich & Karim ireddy, 2019), each client takes the same number of steps per unit time according to a global clock. So the local it erations are in synchrony with respect to the global clock, but the asynchrony comes with the communication of the clients. As we will discuss the only assumption on the client communication we make is that each client commu nicates to the server at least once every Δ 1 rounds. In this paper, we analyze the local SGD algorithm in this asynchronous communication setting.3 We consider both smooth strongly convex as well as smooth nonconvex ob jectives. Under a standard assumption of bounded second moment (Rakhlin et al., 2012; Yu et al., 2019b; Stich, 2018; Basu et al., 2019), we show that the convergence rate of our proposed local SGD with asynchronous update matches that of synchronous local SGD where all clients communi cate together every Δ rounds. This is the frst result show ing that local SGD even with asynchronous updates from 1Local SGD is different from minibatch SGD where the av eraging happens after every iteration, but more closely related to minibatch SGD with Δ times larger batchsizes on each client. 2This model also subsumes the standard synchronous model where all clients communicate in each round. 3Stich (2018) considered a different variant of asynchronous local SGD, where each client has identical data distribution; how ever, the clients can evolve their computation at slightly different rates resulting in delayed stochastic gradient updates. the clients performs as well as the synchronous local SGD in the heterogeneous (non-identical) data setting. Previous convergence bounds from (Basu et al., 2019) were consid erably weaker under similar assumptions. In other words, while due to asynchronous nature of communication the gradient information for some of the clients at the server might be stale, somewhat surprisingly our results show that as long as this period of staleness is bounded by some Δ, we get similar convergence behavior (under our assump tions) as a synchronous local SGD where the communica tion happens every fxed Δ rounds. For simplicity of discussion, in this section we ignore the dependence on various parameters such as strong convex ity, smoothness, variance bound, and gradient norm bound. Table 1 summarizes our main theoretical results. For smooth strongly convex functions, we show that an av eraged iterate xˆT satisfes (see Theorem 2.2 for a precise statement): E[f(ˆx T ) f(x?)] = O 1/NT + Δ2/T 2 , ? where x is a minimizer of f. In particular, we can set p Δ = O( T/N) and reach asymptotically the same conver gence rate of minibatch SGD of O(1/NT ).4 This matches the bound on Δ known with synchronous local SGD (Basu et al., 2019, Corollary 3) and improves the previously best known bound on Δ in our asynchronous update set ting (Basu et al., 2019, Corollary 5) by a square factor, from p (T/N) 1/4 to T/N. For smooth nonconvex functions, we show that iterate xt satisfes (see Theorem 2.4 for a precise statement): PT 1 E[krf(xt)k2] = O 1/ NT + NΔ2/T . T t=0 In particular, we can set Δ = O(T 3/4) and reach asymptotically the same convergence rate of minibatch SGD of O(1/ NT ). Again, this matches the bound on Δ known with synchronous local SGD, see e.g., (Basu et al., 2019, Corollary 2) or (Yu et al., 2019b, Corollary 3). Sim ilarly, this improves the previously best known bound on Δ in our asynchronous update setting (Basu et al., 2019, 1/8 1/4 3/8 3/4 Corollary 4) by a square factor, from T /N to T /N . 4It is desirable to have larger Δ as it translates into lower com munication overhead. Federated Learning under Arbitrary Communication Patterns Finally, we also empirically evaluate multiple instantiations of our scheme to demonstrate the various factors affecting the performance in practice. Comparison to Federated Averaging. Federated Aver aging (Fed Avg) (Mc Mahan et al., 2017; Koneˇcn y et al., 2016), one of the most widely used algorithm for FL ap plications, is a variant of local SGD. In the basic version of Fed Avg, the participating devices (clients) are sampled randomly in each round, and only those devices perform the SGD steps on their local data and send back the results to the server. Notice that even though related, our model, which is the asynchronous communication version of local SGD as also discussed in (Stich, 2018; Basu et al., 2019; Stich & Karimireddy, 2019), is different in that we assume all the clients perform local updates in each round. Stan dard analyses of Fed Avg (as defned above), in the non identical data setting, rely on the assumption that the server gets access to a random subset of clients, and if the clients are unavailable, then either the assumptions are violated or we run into straggler issues (Kairouz et al., 2019). A com mon practical heuristic in this case is to over sample and then take the frst few responding clients, but this in fact constructs a biased set at the server (since more powerful clients are selected). Another point of distinction is that, unlike our model, Fed Avg does not capture scenarios where clients communicate at their convenience. 1.2. Related Work With the increasing popularity of federated learning there has been lots of recent interest in understanding the conver gence properties of local SGD. We refer the reader to recent excellent surveys (Kairouz et al., 2019; Li et al., 2020a) for a more comprehensive review of developments in federated learning algorithms. To emphasize the difference of our re sults from previous ones, we categorize the previous results into different (non-exclusive) groups. Also note that not all these previous results had strong theoretical convergence guarantees which is of focus in this paper. Identical Data Distribution on Clients. A line of work has focused on analyzing local SGD under identical data distribution on clients (Zhou & Cong, 2017; Jiang & Agrawal, 2018; Wang & Joshi, 2018; Stich, 2018; Stich & Karimireddy, 2019; Haddadpour et al., 2019; Khaled et al., 2019b; Wang & Joshi, 2018). If all the clients have identi cal data distribution, then that would result in unbiasedness of gradients at every client, resulting in slightly easier anal ysis. However, as discussed earlier, this is not a reason able assumption for FL applications where data available locally fail to represent the overall distribution. We make no assumptions on the local data distributions. Synchronous/Random Client Communication with Non-identical Data Distributions. Another line of work has focused on synchronous local SGD with non-identical data distribution across clients wherein all clients commu nicate their local parameter to the server every fxed Δ rounds (Yu et al., 2019b; Haddadpour & Mahdavi, 2019; Khaled et al., 2019b;a; Wang et al., 2019c; Basu et al., 2019; Li et al., 2019b). Again as discussed, the require ment of full device (synchronous) participation is not gen erally practical in FL. Another variant is that a set of ran dom clients communicate in each round (Mc Mahan et al., 2017; Li et al., 2018; 2019a), which again is hard to en sure or verify in practice. We make no assumptions on the client communication patterns except that each client participates at least once in Δ rounds. In particular, syn chronous and random client communication can be thought as special cases of our setup. Extensions to SGD. We note that more sophisticated lo cal stochastic gradient methods have also been consid ered, for example with momentum (Yu et al., 2019a; Wang et al., 2019b) with gradient compression (Jiang & Agrawal, 2018; Basu et al., 2019; Reisizadeh et al., 2020), with other various variance-reduction methods (Liang et al., 2019; Sharma et al., 2019; Karimireddy et al., 2019). Our work is complimentary to these approaches, and focuses on the vanilla version of local SGD commonly used in practice. The most relevant result to us is that of (Basu et al., 2019), who considered an asynchronous communication setting similar to that discussed in this paper. They also study additional gradient compression techniques not addressed here. For completeness, we include a discussion of re sults from (Basu et al., 2019) in Section 2.1. Our conver gence results are much tighter, for both strongly convex and nonconvex cases, suggesting that the clients can communi cate much less frequently to achieve the same convergence guarantee. Preliminaries. Since f in (1) is an average of f (i) s, we express it as f = avgi(f (i)) to simplify the presentation. Assume all functions f (i) : Rd R. We review some basic optimization concepts in Appendix A. In this paper, we assume that all f (i) s (and therefore f) are L-smooth. See Table 2 for a full list of notation. 2. Our Algorithm In Algorithm ASYNCCOMMSGD (Algorithm 1), we present our local SGD approach. We assume that there are N clients, labeled 1, . . . , N. For each client i, we maintain two parameter vectors: x(i) , the local parameter vector on t the client, and y(i), the server copy of the last (at round t or t earlier) parameter vector received by the server from client i. The server maintains a global parameter vector xt which accumulates local updates. At each round t, each client performs a stochastic gradi Federated Learning under Arbitrary Communication Patterns Table 2: Notation used in the paper. Notation Explanation Problem parameters N The number of clients f (i) Function at the ith client P 1 f (i) f The objective function: f = N i [N] fmax Bound on the objective value: fmax = f(x0) f(x?) Smoothness parameter of f and f (i), i [N]: krf(x) rf(y)k Lkx yk λ λ (Theorem 2.2) Strong convexity parameter of f: f(x) f(y) + hrf(y), x yi + kx yk2 2 Δ Maximum gap between communications of a single client r F (i)(x, θ) Stochastic gradient computed by ith client at point x. θ is a parameter controlling randomness (i) E[ ] Expectation over stochastic randomness: E[ | θ , i [N], t [T ]] t Gmax Maximum stochastic gradient norm: E[kr F (i)k2] G2 max σ2 Variance of stochastic gradients: for all i, Eθ[kr F (i)(x, θ) rf (i)(x)k2] σ2 1 P avgi(. . .) Average over all clients: N i [N] . . . x(i) Value stored by ith client at tth iteration t xt Value stored by the server at tth iteration y(i) The latest value communicated to ith client from the server before tth iteration t γt Gradient step size (learning rate) (i) (i) (i) (i) (i) G Stochastic gradient computed by ith client at tth iteration at point x : G = r F (i)(x ; θ ) t t t t t zt A virtual sequence zt+1 = zt γt avgi(G(i)) t Ct A set of clients communicating at tth iteration ρ(i)(t) (App. B) Last communication round for machine i up to iteration t: ρ(i)(t) = max({τ t | i Cτ } {0}) ent descent step. Then a subset of clients Ct+1 (possibly empty) send their updates to the server. The server aggre gates them, updates the global parameter vector xt+1 and y(i) , and sends it back to clients from Ct+1. The clients t update their local parameter vector x(i) to global param t+1 eter vector xt+1. A similar local SGD with asynchronous update algorithm was also considered by Basu et al. (2019, Algorithm 2) with the additional gradient compression op erator. For constructing xt+1, we average over the latest parameter vectors of all the clients currently available on the server. This on the frst glance might look problematic as some of these updates might be stale say if a client has not com municated recently. This is also different from a typical Federated Averaging scheme, where the averaging is done only over the parameter vectors of a random set of commu nicated clients. However, this averaging is crucial for our analysis. In fact, Federated Averaging scheme based on clients communicating randomly will not have a good con vergence rates if there are rounds in which only a small set of clients communicate, something that our analysis does not suffer from. Moreover, if some clients don t communi cate, then it s impossible to fnd a minimizer of the objec tive, since each client could have different data distribution. Therefore, it is crucial to have an upper bound (Δ) on the maximum delay between each clients update times. Note that the set Ct+1 doesn t need to be known to the server; for example, a round can start and end at a pre specifed global clock time, and all the clients that com municate within this time window then form the set Ct+1. This way the communication patterns are controlled by the clients and is not at the behest of the server. For ex ample, in a FL setting, if a client has connectivity issue, then client could communicate back when the connectiv ity is restored. For simplicity of presentation, in Algo rithm ASYNCCOMMSGD, we assume that server knows Ct+1. 2.1. Convergence Analysis In this section we present our main convergence results for local SGD with asynchronous updates, obtained by running Algorithm ASYNCCOMMSGD on smooth functions, both strongly convex and nonconvex. Missing details from this section are collected in Appendix B. We use the following standard assumptions. i. Smoothness: All local functions f (i) (i [N]) are L-smooth (see Defnition 3, Appendix A) ii. Bounded second moment: There exists a Gmax > 0 such that E[kr F (i)(x)k2] G2 for all x max Federated Learning under Arbitrary Communication Patterns Algorithm 1 ASYNCCOMMSGD parameters: { γt } step sizes, T the number of rounds, x0 starting point, { Ct } for each t, which clients communicate at iteration t On each client i [N]: x(i) x0 // Local parameters on the client 0 y(i) x0 // Last parameters received by the client 0 for t = 0 . . . T 1 do Client Update: // Run on each client i (i) (i) G stochastic gradient for f (i) at x t t (i) (i) (i) v x γt G // Local SGD step t+1 t t if i Ct+1 then (i) (i) (i) Send δ y to the server t+1 := vt+1 t Receive xt+1 (i) xt+1 xt+1 (i) yt+1 xt+1 else (i) (i) x v t+1 t+1 (i) (i) y y t+1 t end if Server Update: // Run on the server Receive δ(i) from clients i Ct+1 t+1 P 1 (i) xt+1 xt + N i Ct+1 δt+1 // Aggregate updates Send xt+1 to clients i Ct+1 end for Rd, i [N], where r F (i)(x) is an unbiased stochas tic gradient of f (i) at x. This is a standard assump tion in the SGD literature (Rakhlin et al., 2012; Stich, 2018; Stich et al., 2018; Yu et al., 2019b; Basu et al., 2019) etc.5 Relaxing this bounded gradient assumption, as achieved through different parameters in recent dis tributed SGD/GD literature (see, e.g., (Wang & Joshi, 2018; Khaled et al., 2019b; Haddadpour et al., 2019; Yu et al., 2019a; Li et al., 2020b; Wang et al., 2019a; Li et al., 2018)) is an interesting open problem. The second moment assumption also implies a bound on the variance, E[kr F (i)(x) rf (i)(x)k2] σ2 for all x Rd, i [N] (where σ2 G2). (i) (i) Let G = r F (i)(x ) be a stochastic gradient computed t t by the ith client at the t-th round. Recall that Ct+1 is the set of clients communicating at the t-th round. Then our update equation on the clients has the following form: ( (i) (i) (i) xt γt Gt , Ct+1 i / x = t+1 xt+1, i Ct+1, 5A consequence of this assumption is that it also bounds gra dient difference. For example krf (i)(x) rf (i0)(x)k Gmax for any two clients i, i0 [N] and for all x Rd . where xt+1 is the server model accumulating updates com municated to the server. Let ρ(i)(t) = max{τ t|i Cτ } be the last round before t such that the client i communicates with the server at this round. Since the server received updates from client i up to Pρ(i) (t) (i) this round, we have: xt = x0 avgi γτ Gτ . τ=0 To show convergence rates, we investigate kxt x?k2 , where x? is a minimizer of f. Unfortunately, xt has a rather complicated update equation. To address this issue, we de fne a virtual sequence {zt}t N the following way: Defnition 1 (Virtual sequence) t 1 X zt = x0 γτ avgi(G(i)). τ τ =0 Similar virtual sequences have been utilized before in de centralized optimization under various contexts (Lian et al., 2017; Yuan et al., 2016; Nedi c et al., 2018; Stich, 2018). We show (see Proposition 2.1) that zt s are close to xt and x(i) s for all clients i. The advantage of working with zt t is its simple update equation: zt+1 = zt γt avgi(G(i)), t which makes the analysis cleaner. The following proposi tion bounds the distance between the virtual zt and the local x(i) in terms of the parameter Δ. This proposition and its t proof is similar to (Stich, 2018, Lemma B.1). Proposition 2.1 (Distance Bound) Let {γt} be a nonincreasing sequence such that γt/γt+Δ 2. Then in Al gorithm ASYNCCOMMSGD for each client i [N]: max E[kzt x(i)k2], E[kzt xtk2] 72γ2G2 Δ2 t t max Analysis for Strongly Convex Functions. We now as sume that f (i) for i [N] are L-smooth functions and f = avgi(f (i)) is a λ-strongly convex (Defnitions 2 and 3). In Theorem 2.2, we state the convergence theorem for strongly convex functions when using Algorithm ASYNC COMMSGD. Instead of xt, we consider a weighted aver PT 1 age xˆt = t=0 wtxt where wt = (Δ + t)2 . Therefore, ST the sequence {xˆt}t [T ] can be easily computed from the sequence {xt}t [T ]. This choice of wt puts more weights on later rounds. A similar xˆt was also considered in the previous related work of (Stich, 2018; Basu et al., 2019). ? Our analysis starts by bounding E[kzt+1 x?k2], where x is the minimizer for f. Using properties of strong convexity and smoothness, along with Proposition 2.1, we establish, γtλ E[kzt+1 x ?k2] 1 E[kzt x ?k2] 2 σ2 300γ3 t 2γt(f(zt) f(x ?)) + γ2 + L2G2 Δ2 . t max N λ Federated Learning under Arbitrary Communication Patterns This along with a recurrence relation from (Stich et al., 2018) gives a bound in terms of zˆT G2 (Δ + 4L/λ)3 max E[f(ˆz T )] f(x ?) 4λST 2T (T + 2Δ + 8L/λ) σ2 104T + + L2G2 Δ2 . max λST N λ3ST Now using Proposition 2.1, that shows that the virtual se quence zˆT is close to xˆT , yields the following result. Theorem 2.2 Let f (i) s for i [N] be L-smooth functions P and f = f (i) be a L-smooth λ-strongly convex i [N] PT function. Let wt = (Δ + t)2 , St = T 3 , xˆt = t=0 wt PT 1 After T rounds of Algorithm ASYNC ST t=0 wtxt. 8 COMMSGD with γt = , we have λ(t+Δ+8L/λ) LG2 (Δ + L/λ)3 max E[f(ˆx T )] f(x ?) = O λ2ST LT (T + Δ + L/λ) σ2 L3G2 Δ2T + + max . λ2ST N λ4ST In particular, for a fxed Δ, we get a convergence rate of O(1/ NT + N/T ) (ignoring other terms). Using the fact that ST T 3 , we get the following result: Corollary 2.3 Under assumptions of Theorem 2.2, if T p N, Δ T/N and Δ L/λ, then E[f(x T ) f(x ?)] Lσ2 LG2 1 L2 max = O + + . λ2TN λ2TN TN λ2 In particular, in terms of T and N we recover the stan dard minibatch SGD convergence rate for strongly convex functions of O(1/T N). In other words, for achieving this convergence rate within T rounds, we need each client to communicate T/Δ = Ω( TN) many times. Compare this with the corresponding result from (Basu et al., 2019, Corollary 5) (we change notation to match ours and consider the case without compression): E[f(ˆx T ) f(x ?)] G2 Δ3 (T + Δ) σ2 G2 Δ4 max max = O + + λ2T 3 λ2T 2 N λ3T 2 In our case, the last term has much better dependence on Δ (Δ2 instead of Δ4). As a consequence, we can improve p bound on Δ from (T/N) 1/4 to square of that: T/N. Analysis for Nonconvex Functions. We now assume that f (i) for i [N] are L-smooth functions (see Defnition 3), but f is not necessarily convex. We use a fxed step size γ, and therefore the condition of Proposition 2.1 is always satisfed, and we can directly use the iterates xt produced by Algorithm ASYNCCOMMSGD. We again consider the virtual sequence zt per Defni tion 1. In this case, our analysis is based on bounding PT 1 E[krf(zt)k2] with T t=0 4(f(z0) E[f(z T )]) Lσ2 + 4γ(10γL2G2 Δ2 + ). max γT 2N Then we bound krf(xt)k in terms of krf(zt)k again us ing Proposition 2.1 as, E[krf(xt)k2] 2E[krf(zt)k2] + 36γ2L2G2 Δ2 . t max The following theorem follows from these inequalities. Theorem 2.4 Let fmax = f(x0) f(x?). After T rounds of Algorithm ASYNCCOMMSGD with step size 0 γ 1/(18L), we have T X 1 E[krf(xt)k2] T t=0 fmax Lσ2 Δ2 = O + γ2L2G2 + γ . max γT N The next corollary follows by substituting suitable γ and other parameters. Corollary 2.5 Let fmax = f(x0) f(x?). In Theorem 2.4, using step size γ = N/(L T ), we get T X 1 E[krf(xt)k2] T t=0 Lfmax N Lσ2 = O + G2 Δ2 + . max NT T NT Using step size γ = N/(L T ), if T > N 3 and Δ T 1/4/N 3/4, we get T Lfmax Gmax 2 Lσ2 E[krf(xt)k2] = O + + . T t=0 NT NT NT In particular, for a fxed Δ, we get a convergence rate of O(1/ NT + N/T ) (ignoring other terms). If T > N 3 and T 1/4 Δ /N3/4, then we recover the standard minibatch SGD convergence rate of O(1/ NT ). In other words, for achieving this convergence rate within T rounds, we need each client to communicate Ω(N 3/4T 3/4) many times. Again, compare this with the corresponding result from (Basu et al., 2019, Corollary 4): NG2 Δ4 σ fmax fmax max E[krf(ˆxt)k2] = O + . σ2T NT Federated Learning under Arbitrary Communication Patterns RR(k = 2, Δ = 5) Random(p = 0.04) Full SGD(Δ = 5) RR(k = 2, Δ = 1) Random(p = 0.2) Full SGD(Δ = 1) 0 50 100 150 200 250 Total Communication from Clients Test Accuracy, % 0 50 100 150 200 250 Total Communication from Clients 68 70 72 74 76 78 80 82 Test Accuracy, % 0 2000 4000 6000 Total Communication from Clients 12 18 24 30 36 42 48 54 60 66 Test Accuracy, % (a) MNIST with mixing rate 1 (b) FASHION-MNIST with mixing rate 1 (c) CIFAR-10 with mixing rate 1 0 50 100 150 200 250 Total Communication from Clients Test Accuracy, % 0 50 100 150 200 250 Total Communication from Clients 68 70 72 74 76 78 80 82 Test Accuracy, % 0 2000 4000 6000 Total Communication from Clients 12 18 24 30 36 42 48 54 60 66 Test Accuracy, % (d) MNIST with mixing rate 1/2 (e) FASHION-MNIST with mixing rate 1/2 (f) CIFAR-10 with mixing rate 1/2 0 50 100 150 200 250 Total Communication from Clients 80 81 82 83 84 85 86 87 88 Test Accuracy, % 0 50 100 150 200 250 Total Communication from Clients Test Accuracy, % 0 2000 4000 6000 Total Communication from Clients 15 20 25 30 35 40 45 50 55 Test Accuracy, % (g) MNIST with mixing rate 1/10 (h) FASHION-MNIST with mixing rate 1/10 (i) CIFAR-10 with mixing rate 1/10 0 50 100 150 200 250 Total Communication from Clients 40 45 50 55 60 65 70 75 80 Test Accuracy, % 0 50 100 150 200 250 Total Communication from Clients 36 40 44 48 52 56 60 64 68 Test Accuracy, % 0 2000 4000 6000 Total Communication from Clients 8 12 16 20 24 28 32 36 40 Test Accuracy, % (j) MNIST with mixing rate 0 (k) FASHION-MNIST with mixing rate 0 (l) CIFAR-10 with mixing rate 0 Figure 1: For each mixing rate µ {1, 1/2, 1/10, 0}, we show the test accuracy as a function of total communication (the number of communicated client models). Left column corresponds to MNIST dataset, middle to FASHION-MNIST, right to CIFAR-10 (to improve the presentation, the results for CIFAR-10 with mixing rate 1 are slightly smoothed). We omit IMBALANCED COMMUNICATION here for clarity. The results suggest that a full synchronous update of all the clients to the server is unnecessary as long as the local data distributions are not completely disjoint. In our case, the second term has much better dependence on Δ (Δ2 instead of Δ4). As a consequence, we can improve 1/8 1/4 3/8 3/4 bound on Δ from T /N to square of that: T /N . Federated Learning under Arbitrary Communication Patterns Imbalanced Communication Random(p=0.04) Random(p=0.2) Full SGD(Δ = 1) 0 1000 2000 3000 4000 5000 6000 7000 8000 Total Communication from Clients Test Accuracy, % 0 2000 4000 6000 8000 10000 12000 Total Communication from Clients Test Accuracy, % 0 2000 4000 6000 8000 10000 Total Communication from Clients Test Accuracy, % (a) CIFAR-10 with mixing rate 1/2 (b) CIFAR-10 with mixing rate 1/10 (c) CIFAR-10 with mixing rate 0 Figure 2: Results on CIFAR-10 for Resnet-34 with µ {1/2, 1/10, 0}. The conclusions are same as in Figure 1 even with this bigger network. We omit FULLSGD(Δ = 5), FULLSGD(Δ = 25), RR(k = 2, Δ = 1), and RR(k = 2, Δ = 1) since their results are again similar to considered algorithms, as also observed in Figure 1. To improve the presentation, the plots were slightly smoothed. 3. Experimental Evaluation In this section, we demonstrate the effectiveness of local SGD with asynchronous updates as compared to regular SGD or a synchronous local SGD where all communication takes place together. Our focus will be on illustrating the dependence of model accuracy on the total communication. We also investigate the role of (non)iidness of the clients data distributions. 3.1. Datasets and Models We perform our evaluation on the following datasets: MNIST, FASHION-MNIST, CIFAR-10.6. We use a single-machine simulation of FL computation with 10 clients. In this setup, the running time doesn t take into account the actual communication time, hence not infor mative. Each of our dataset has 10 classes, and we split our data across 10 clients in the following manner. Each client is associated with a class. We defne a mixing rate µ which measures how identical the data distributions across differ ent clients are: for each client, (1 µ) fraction of data is selected from the class corresponding to the client, while µ fraction is selected from a random class. In particular, for µ = 1 the data for all clients is identically distributed, and for µ = 0 each client holds only data corresponding to its class. We consider µ { 0, 1/10, 1/2, 1 } and show how µ affects convergence of our compared algorithms. MNIST and FASHION-MNIST. We use a one-layer neu ral network with softmax activation. At each round, clients process 103 samples (within the round, the client locally performs minibatch gradient descent with batch size 20). 6Dataset and detailed network descriptions are given in Ap pendix C The entire code is provided in supplementary material. CIFAR-10. Here we use two networks. The frst one is a shallow convolutional neural network with 3 convo lutional layers with Re LU activation and max pooling and two dense layers with Re LU and softmax activations. The second network is the deep Res Net-34 (He et al., 2016) without batch normalization. 3.2. Compared Approaches Since Algorithm 1 is quite general it covers multiple client communication scenarios depending on the selection of Ct. Here, we select a few of them to compare (all of whom are captured by our Algorithm 1) to understand the role of various factors involved in FL. Synchronous Local SGD. Each Δ rounds, all clients com municate with the server. Denoted by FULLSGD(Δ). Case Δ = 1 corresponds to a regular distributed SGD. Round Robin. Each Δ rounds, k clients communicate with the server: frst clients 1, . . . , k communicate, then clients k+1, . . . , 2k communicate, etc. In our experiments, we select k = 2, i.e. 1/5 fraction of clients communicates each Δ rounds. Denoted by RR(k, Δ). Random Communication. At each round, each client communicates with the server with probability p. This scheme is closely related to Federated Averaging where the participating clients are sampled randomly (Mc Mahan et al., 2017). In our experiments, we select p = 1/5 and p = 1/25. These approaches have the same expected amount of communication as RR(2, 1) and RR(2, 5) respectively, but it is not guaranteed that each client communicates at least once each Δ rounds. Denoted by RANDOM(p). Imbalanced Communication. Client i communicates ev ery i rounds. i.e. client 1 communicates every round and client 10 communicates every 10 rounds, leading to very imbalanced communication. Federated Learning under Arbitrary Communication Patterns Among these scenarios, Round Robin, Random Commu nication, and Imbalanced Communication are all examples of asynchronous communication scenarios. 3.3. Discussion of Results We present our results in Figures 1 and 2. For each mix ing rate µ, we show how accuracy as a function of the total communication, which is measured as the number of com municated local models. We note that, while the plots are cropped to match the lowest total communication among the algorithm, for most approaches their accuracy continue increasing. We make the following observations. Synchronous vs. Asynchronous: Among our imple mented methods, FULLSGD(1) has the largest commu nication cost per round, FULLSGD(5), RR(2, 1) and RANDOM(1/5) have approximately 1/5 of that communica tion per round, and RR(2, 5) and RANDOM(1/25) have ap proximately 1/25 of that communication per round. In Fig ure 1, approaches with similar communication per round achieve similar accuracy. For example, Figure 1e shows that FULLSGD(1) achieves 76% accuracy, approaches with 1/5 of communication per round achieve 80% 81% accuracy, and approaches with 1/25 communication per round achieve 81.5% 82.5% accuracy. The fact that RR(2, 1) achieves similar guarantees as FULLSGD(5) supports our theoretical conclusions that local SGD with asynchronous communication (like RR) performs as well as the synchronous local SGD (FULLSGD). The same con clusions also hold with Resnet-34 experiments (Figure 2). With IMBALANCED COMMUNICATION, even though some machine communicate much more often than others, for µ = 1/2 (Figure 2a) it outperforms FULLSGD(1) (38% against 30%). However, it s noticeably outperformed by RANDOM approaches which have much lower communi cation per round while having similar communication gap. Role of Δ: FULLSGD(5), RR(2, 1) and RANDOM(1/5) have similar communication requirements; however RANDOM(1/5) is slightly outperformed by the other two. For example, 80.9% accuracy for RANDOM(1/5) against 81.2% (for RR(2, 1)) and 82% (for FULLSGD(5)) in Fig ure 1b. Similarly, in the same plot, RANDOM(1/25) is outperformed by RR(2, 5) (82.4% against 83.4% accura cies). The gap becomes more prominent when µ decreases. For example, in Figure 1h, RANDOM(1/5) achieves 74.6% accuracy, while FULLSGD(5) and RR(2, 1) achieve 76% and 77.5% accuracy, respectively. Accuracy of RANDOM shows large fuctuations during training (see e.g. Figure 1k). We suspect that the reason for this behavior is that, unlike other approaches, RAN DOM doesn t always guarantee that each client communi cates within Δ rounds (the guarantee only holds in expec tation). When the data is non-iid, IMBALANCED COMMU NICATION has the worst accuracy to communication ratio (see Figure 2), which is expected: while its communication gap is 10, the average communication per round is reduced only by the factor of 3.5. Role of (non)iidness: Comparing Figures 1a and 1d, 1b and 1e, 1c and 1f, we see that difference between cases µ = 1 and µ = 1/2 is minor. When further decreas ing µ to 1/10, for RR(2, 5) we observe 3% drop in accu racy for MNIST and FASHION-MNIST and 11% drop for CIFAR-10. However, the largest accuracy drop for RR(2, 5) is experienced when µ changes from 1/10 to 0: 8%, 12% and 20% respectively. Note that when µ = 0 all the local distributions have support on different classes. In general, for µ { 1, 1/2, 1/10 }, approaches with less communication show better convergence, and the gap is more prominent at larger values of µ. The fact that this happens even when µ = 1/10 is interesting, suggesting that approaches like RR(2, 5) works well even when data dis tributions are far from identical. However, when the data distribution is completely non-iid (µ = 0), we observe the opposite behavior: approaches with less communication achieve lower accuracy (e.g., 68% accuracy for RR(2, 5) on FASHION-MNIST) compared to approaches with full communication (e.g., 71% accuracy for FULLSGD(1) on FASHION-MNIST). Again, the conclusions are the same with Resnet-34 (Figure 2). Due to its large communi cation gap, IMBALANCED COMMUNICATION suffers the most when µ decreases: it reaches 38% when µ = 1/2 but drops to 20% when µ = 1/10. Somewhat surprisingly, it doesn t drop further when µ = 0 (18.5%). 4. Concluding Remarks In this paper we demonstrated that we can signifcantly re lax the communication requirements on the clients to cap ture practical scenarios and still achieve the standard con vergence rates in a Federated Learning setting. We empha size that the remaining assumption on the communication gap is necessary: when it s unbounded, learning algorithms can easily diverge. One possible future direction is to eliminate bounded gra dient assumption: E[kr F (i)tk2] G2 . One possible max alternative is the assumption that gradients of local func tions are not much different from that of the global func tion: krf (i)(x) rf(x)k < κ (Yu et al., 2019a; Li et al., 2020b; Wang et al., 2019a). Our experiments support the idea that deviation from the global data distribution is an important parameter. However, this above condition with κ doesn t fully capture our observations as our experiments show the accuracy improves dramatically even when only a small fraction of global data is present at every client, which is not enough to substantially decrease κ. Federated Learning under Arbitrary Communication Patterns Basu, D., Data, D., Karakus, C., and Diggavi, S. Qsparse local-sgd: Distributed sgd with quantization, spar sifcation, and local computations. ar Xiv preprint ar Xiv:1906.02367 (Also in Neru IPS 2019), 2019. Coppola, G. F. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. 2015. Haddadpour, F. and Mahdavi, M. On the convergence of lo cal descent methods in federated learning. ar Xiv preprint ar Xiv:1910.14425, 2019. Haddadpour, F., Kamani, M. M., Mahdavi, M., and Cadambe, V. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems, pp. 11082 11094, 2019. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Jiang, P. and Agrawal, G. A linear speedup analysis of distributed deep learning with sparse and quantized com munication. In Advances in Neural Information Process ing Systems, pp. 2525 2536, 2018. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T. Scaffold: Stochastic con trolled averaging for on-device federated learning. ar Xiv preprint ar Xiv:1910.06378, 2019. Khaled, A., Mishchenko, K., and Richt arik, P. First anal ysis of local gd on heterogeneous data. ar Xiv preprint ar Xiv:1909.04715, 2019a. Khaled, A., Mishchenko, K., and Richt arik, P. Tighter the ory for local sgd on identical and heterogeneous data. ar Xiv, pp. ar Xiv 1909, 2019b. Konecn ˇ arik, P., y, J., Mc Mahan, H. B., Yu, F. X., Richt Suresh, A. T., and Bacon, D. Federated learning: Strate gies for improving communication effciency. ar Xiv preprint ar Xiv:1610.05492, 2016. Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated optimization in heterogeneous networks. ar Xiv preprint ar Xiv:1812.06127, 2018. Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Feder ated learning: Challenges, methods, and future direc tions. IEEE Signal Processing Magazine, 37(3):50 60, 2020a. Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. In Interna tional Conference on Learning Representations, 2019a. Li, X., Yang, W., Wang, S., and Zhang, Z. Communicationeffcient local decentralized sgd methods. ar Xiv preprint ar Xiv:1910.09126, 2019b. Li, X., Yang, W., Wang, S., and Zhang, Z. Communicationeffcient local decentralized sgd methods, 2020b. Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outper form centralized algorithms? a case study for decentral ized parallel stochastic gradient descent. ar Xiv preprint ar Xiv:1705.09056, 2017. Liang, X., Shen, S., Liu, J., Pan, Z., Chen, E., and Cheng, Y. Variance reduced local sgd with lower communication complexity. ar Xiv preprint ar Xiv:1912.12844, 2019. Mangasarian, L. Parallel gradient distribution in uncon strained optimization. SIAM Journal on Control and Op timization, 33(6):1916 1925, 1995. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-effcient learning of deep networks from decentralized data. In Artifcial Intelli gence and Statistics, pp. 1273 1282. PMLR, 2017. Nedi c, A., Olshevsky, A., and Rabbat, M. G. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953 976, 2018. Rakhlin, A., Shamir, O., and Sridharan, K. Making gradi ent descent optimal for strongly convex stochastic opti mization. ar Xiv preprint ar Xiv:1109.5647, 2011. Rakhlin, A., Shamir, O., and Sridharan, K. Making gradi ent descent optimal for strongly convex stochastic opti mization. 2012. Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-effcient federated learning method with periodic averaging and quantization. In International Conference on Artifcial Intelligence and Statistics, pp. 2021 2031, 2020. Sharma, P., Khanduri, P., Bulusu, S., Rajawat, K., and Varshney, P. K. Parallel restarted spider communication effcient distributed nonconvex opti mization with optimal computation complexity. ar Xiv preprint ar Xiv:1912.06036, 2019. Federated Learning under Arbitrary Communication Patterns Stich, S. U. Local sgd converges fast and communicates little. ar Xiv preprint ar Xiv:1805.09767, 2018. Stich, S. U. and Karimireddy, S. P. The error-feedback framework: Better rates for sgd with delayed gradi ents and compressed communication. ar Xiv preprint ar Xiv:1909.05350, 2019. Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsifed sgd with memory. In Advances in Neural Information Processing Systems, pp. 4447 4458, 2018. Wang, J. and Joshi, G. Cooperative sgd: A unifed framework for the design and analysis of communication-effcient sgd algorithms. ar Xiv preprint ar Xiv:1808.07576, 2018. Wang, J., Sahu, A. K., Yang, Z., Joshi, G., and Kar, S. Matcha: Speeding up decentralized sgd via matching decomposition sampling. ar Xiv preprint ar Xiv:1905.09435, 2019a. Wang, J., Tantia, V., Ballas, N., and Rabbat, M. Slowmo: Improving communication-effcient distributed sgd with slow momentum. ar Xiv preprint ar Xiv:1910.00643, 2019b. Wang, S., Tuor, T., Salonidis, T., Leung, K. K., Makaya, C., He, T., and Chan, K. Adaptive federated learning in re source constrained edge computing systems. IEEE Jour nal on Selected Areas in Communications, 37(6):1205 1221, 2019c. Yu, H., Jin, R., and Yang, S. On the linear speedup analysis of communication effcient momentum sgd for distributed non-convex optimization. ar Xiv preprint ar Xiv:1905.03817, 2019a. Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd with faster convergence and less communication: Demystify ing why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artifcial Intel ligence, volume 33, pp. 5693 5700, 2019b. Yuan, K., Ling, Q., and Yin, W. On the convergence of decentralized gradient descent. SIAM Journal on Opti mization, 26(3):1835 1854, 2016. Zhou, F. and Cong, G. On the convergence properties of a k-step averaging stochastic gradient descent al gorithm for nonconvex optimization. ar Xiv preprint ar Xiv:1708.01012, 2017. Zhou, F. and Cong, G. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. In Proceedings of the 27th International Joint Conference on Artifcial Intelligence, pp. 3219 3227, 2018. Zinkevich, M., Weimer, M., Smola, A. J., and Li, L. Par allelized stochastic gradient descent. In NIPS, volume 4, pp. 4. Citeseer, 2010.