# optimal_client_sampling_for_federated_learning__cc199fe6.pdf Published in Transactions on Machine Learning Research (08/2022) Optimal Client Sampling for Federated Learning Wenlin Chen wc337@cam.ac.uk Department of Engineering University of Cambridge Cambridge, CB2 1PZ, UK and Department of Empirical Inference Max Planck Institute for Intelligent Systems Tübingen, 72076, Germany Samuel Horváth samuel.horvath@mbzuai.ac.ae Department of Machine Learning Mohamed bin Zayed University of Artificial Intelligence Masdar City, Abu Dhabi, UAE Peter Richtárik peter.richtarik@kaust.edu.sa Computer, Electrical and Mathematical Science and Engineering Division King Abdullah University of Science and Technology Thuwal, 23955-6900, Saudi Arabia Reviewed on Open Review: https: // openreview. net/ forum? id= 8Gv RCWKHIL It is well understood that client-master communication can be a primary bottleneck in federated learning (FL). In this work, we address this issue with a novel client subsampling scheme, where we restrict the number of clients allowed to communicate their updates back to the master node. In each communication round, all participating clients compute their updates, but only the ones with important updates communicate back to the master. We show that importance can be measured using only the norm of the update and give a formula for optimal client participation. This formula minimizes the distance between the full update, where all clients participate, and our limited update, where the number of participating clients is restricted. In addition, we provide a simple algorithm that approximates the optimal formula for client participation, which allows for secure aggregation and stateless clients, and thus does not compromise client privacy. We show both theoretically and empirically that for Distributed SGD (DSGD) and Federated Averaging (Fed Avg), the performance of our approach can be close to full participation and superior to the baseline where participating clients are sampled uniformly. Moreover, our approach is orthogonal to and compatible with existing methods for reducing communication overhead, such as local methods and communication compression methods. 1 Introduction We consider the standard cross-device federated learning (FL) setting (Kairouz et al., 2019), where the objective is of the form i=1 wifi(x) where x Rd represents the parameters of a statistical model we aim to find, n is the total number of clients, each fi : Rd R is a continuously differentiable local loss function which depends on the data distribution Published in Transactions on Machine Learning Research (08/2022) Di owned by client i via fi(x) = Eξ Di [f(x, ξ)], and wi 0 are client weights such that Pn i=1 wi = 1. We assume the classical FL setup in which a central master (server) orchestrates the training by securely aggregating updates (Du & Atallah, 2001; Goryczka & Xiong, 2015; Bonawitz et al., 2017; So et al., 2021), i.e., the master only has access to the sum of updates from clients without seeing the raw data. 1.1 Motivation: Communication Bottleneck in Federated Learning It is well understood that communication cost can be a primary bottleneck in cross-device FL, since typical clients are mobile phones or different Io T devices that have limited bandwidth and availability for connection (Van Berkel, 2009; Huang et al., 2013). Indeed, wireless links and other end-user internet connections typically operate at lower rates than intra-datacenter or inter-datacenter links and can be potentially expensive and unreliable. Moreover, the capacity of the aggregating master and other FL system considerations imposes direct or indirect constrains on the number of clients allowed to participate in each communication round. These considerations have led to significant interest in reducing the communication bandwidth of FL systems. Local Methods. One of the most popular strategies is to reduce the frequency of communication and put more emphasis on computation. This is usually achieved by asking the devices to perform multiple local steps before communicating their updates. A prototype method in this category is the Federated Averaging (Fed Avg) algorithm (Mc Mahan et al., 2017), an adaption of local-update to parallel SGD, where each client runs some number of SGD steps locally before local updates are averaged to form the global update for the global model on the master. The original work was a heuristic, offering no theoretical guarantees, which motivated the community to try to understand the method and various existing and new variants theoretically (Stich, 2019; Lin et al., 2018; Karimireddy et al., 2019; Stich & Karimireddy, 2020; Khaled et al., 2020; Hanzely & Richtárik, 2020). Communication Compression Methods. Another popular approach is to reduce the size of the object (typically gradients) communicated from clients to the master. This approach is referred to as gradient/communication compression. In this approach, instead of transmitting the full-dimensional gradient/update vector g Rd, one transmits a compressed vector C(g), where C : Rd Rd is a (possibly random) operator chosen such that C(g) can be represented using fewer bits, for instance by using limited bit representation (quantization) or by enforcing sparsity (sparsification). A particularly popular class of quantization operators is based on random dithering (Goodall, 1951; Roberts, 1962); see Alistarh et al. (2017); Wen et al. (2017); Zhang et al. (2017); Ramezani-Kebrya et al. (2019). A new variant of random dithering developed in Horváth et al. (2019) offers an exponential improvement on standard dithering. Sparse vectors can be obtained by random sparsification techniques that randomly mask the input vectors and preserve a constant number of coordinates (Wangni et al., 2018; Konečný & Richtárik, 2018; Stich et al., 2018; Mishchenko et al., 2019; Vogels et al., 2019). There is also a line of work (Horváth et al., 2019; Basu et al., 2019) which propose to combine sparsification and quantization to obtain a more aggressive combined effect. Client Sampling/Selection Methods. In the situation where partial participation is desired and a budget on the number of participating clients is applied, a careful selection of the participating clients can lead to better communication complexity, and hence faster training. In other words, some clients will have more informative updates than others in any given communication round, and thus the training procedure will benefit from capitalizing on this fact by ignoring some of the worthless updates (see Figure 1). We refer the readers to Section 4.1 for discussions on existing client sampling methods in FL and their limitations. 1.2 Contributions We address the communication bandwidth issues appearing in FL by designing a principled optimal client sampling scheme with client privacy and system practicality in mind. We show that the ideas presented in the previous works on efficient sampling (Horváth & Richtárik, 2019) and sparsification (Wang et al., 2018; Wangni et al., 2018) can be adapted to be compatible with FL and can be used to construct a principled optimal client sampling scheme which is capable of identifying the most informative clients in any given communication round. Our contributions can be summarized as follows: Published in Transactions on Machine Learning Research (08/2022) Figure 1: Optimal client sampling: in each communication round, all participating clients compute their updates, but only the ones with important updates communicate back to the master. Inspired by Horváth & Richtárik (2019), we propose an adaptive partial participation strategy for reducing communication in FL. This strategy relies on a careful selection of clients that are allowed to communicate their updates back to the master in any given communication round, which then translates to a reduction in the number of communicated bits. We obtain this strategy by properly applying the sampling procedure from Horváth & Richtárik (2019) to the FL framework. Specifically, building upon the importance sampling results in Horváth & Richtárik (2019, Lemma 1), we obtain an optimal adaptive client sampling procedure in the sense that it minimizes the variance of the master update for any budget m on the number of participating clients, which generalizes the theoretical results in Zhao & Zhang (2015) that only applies to m = 1. Inspired by the greedy algorithm from Wangni et al. (2018, Algorithm 3) which was originally designed for gradient sparsification, we obtain an approximation to our optimal sampling strategy which only requires aggregation, fulfilling two core privacy requirements of FL: to our knowledge, our method is the first principled importance client sampling strategy that is compatible with both secure aggregation and stateless clients. Our optimal sampling method is orthogonal to and hence compatible with existing approaches to communication reduction such as communication compression and/or local updates (cf. Section 3.2). We provide convergence guarantees for our approach with Distributed SGD (DSGD) and Federated Averaging (Fed Avg), relaxing a number of strong assumptions employed in prior works. We show both theoretically and empirically that the performance of our approach is superior to uniform sampling and can be close to full participation. We show both theoretically and empirically that our approach allows for larger learning rates than the baseline which performs uniform client sampling, which results in better communication complexity and hence faster convergence. 1.3 Organization of the Paper Section 2 describes the proposed optimal client sampling strategy for reducing the communication bottleneck in federated learning. Section 3 provides convergence analyses for DSGD and Fed Avg with our optimal client sampling scheme in both convex and non-convex settings. Section 4 reviews prior works that are closely or broadly related to our proposed method. Section 5 empirically evaluates our optimal client sampling method on standard federated datasets. Section 6 summarizes the paper and lists some directions for future work. Published in Transactions on Machine Learning Research (08/2022) 2 Smart Client Sampling for Reducing Communication This section describes the proposed optimal client sampling strategy for reducing the communication bottleneck in federated learning. Before proceeding with our theory, we provide an intuition by discussing the problem setting and introducing the arbitrary sampling paradigm. In FL, each client i participating in round k computes an update vector Uk i Rd. For simplicity and ease of exposition, we assume that all clients i [n] := {1, 2, . . . , n} are available in each round1. In our framework, only a subset of clients communicates their updates to the master node in each communication round in order to reduce the number of transmitted bits. In order to provide an analysis in this framework, we consider a general partial participation framework (Horváth & Richtárik, 2020), where we assume that the subset of participating clients is determined by an arbitrary random set-valued mapping S (i.e., a sampling ) with values in 2[n]. A sampling S is uniquely defined by assigning probabilities to all 2n subsets of [n]. With each sampling S we associate a probability matrix P Rn n defined by Pij := Prob({i, j} S). The probability vector associated with S is the vector composed of the diagonal entries of P: p = (p1, . . . , pn) Rn, where pi := Prob(i S). We say that S is proper if pi > 0 for all i. It is easy to show that b := E [|S|] = Trace (P) = Pn i=1 pi, and hence b can be seen as the expected number of clients participating in each communication round. Given parameters p1, . . . , pn [0, 1], consider a random set S [n] generated as follows: for each i [n], we include i in S with probability pi. This is called independent sampling, since the event i S is independent of j S for any i = j. While our client sampling strategy can be adapted to essentially any underlying learning method, we give details here for DSGD as an illustrative example, where the master update in each communication round is of the form xk+1 = xk ηk Gk with Gk := X wi pk i Uk i , (2) where Sk Sk and Uk i = gk i is an unbiased estimator of fi(xk). The scaling factor 1 pk i is necessary in order to obtain an unbiased estimator of the true update, i.e., ESk Gk = Pn i=1 wi Uk i . 2.1 Optimal Client Sampling A simple observation is that the variance of our gradient estimator Gk can be decomposed into E h Gk f(xk) 2i = E i=1 wi Uk i i=1 wi Uk i f(xk) where the second term on the right-hand side is independent of the sampling procedure, and the first term is zero if every client sends its update (i.e., if pk i = 1 for all i). In order to provide meaningful results, we restrict the expected number of clients to communicate in each round by bounding bk := Pn i=1 pk i by some positive integer m n. This raises the following question: What is the sampling procedure that minimizes (3) for any given m? To answer this question, we connect Equation (3) to previous works on importance sampling (Horváth & Richtárik, 2019) and gradient sparsification (Wangni et al., 2018; Wang et al., 2018)2. Despite difference in motivation, these works solve up to a scale the equivalent mathematical problem, based on which we answer the aforementioned question by the following technical lemma (see Appendix A for a proof): Lemma 1. (Generalization of Horváth & Richtárik (2019, Lemma 1)) Let ζ1, ζ2, . . . , ζn be vectors in Rd and w1, w2, . . . , wn be non-negative real numbers such that Pn i=1 wi = 1. Define ζ := Pn i=1 wiζi. Let S be a proper sampling. If v Rn is such that P pp Diag(p1v1, p2v2, . . . , pnvn), (4) 1This is not a limiting factor, as all presented theory can be easily extended to the case of partial participation with an arbitrary proper sampling distribution. See Appendix E for a proof sketch. 2Wangni et al. (2018) consider a slightly different problem, where they minimize the communication budget with constraints on the variance. Published in Transactions on Machine Learning Research (08/2022) Algorithm 1 Optimal Client Sampling (OCS). 1: Input: expected batch size m 2: each client i computes a local update Uk i (in parallel) 3: each client i sends the norm of its update uk i = wi Uk i to the master (in parallel) 4: master computes optimal probabilities pk i using equation (7) 5: master broadcasts pk i to all clients 6: each client i sends its update wi pk i Uk i to the master with probability pk i (in parallel) i=1 w2 i vi pi ζi 2 , (5) where the expectation is taken over S. Whenever (4) holds, it must be the case that vi 1 pi. It turns out that given probabilities {pi}, among all samplings S satisfying pi = Prob(i S), the independent sampling (i.e., pij = Prob(i, j S) = Prob(i S) Prob(j S) = pipj) minimizes the left-hand side of (5). This is due to two nice properties: a) any independent sampling admits the optimal choice of v, i.e., vi = 1 pi for all i, and b) (5) holds as equality for independent sampling. In the context of our method, these properties can be written as i=1 wi Uk i i=1 w2 i 1 pk i pk i It now only remains to find the parameters {pk i } defining the optimal independent sampling, i.e., one that minimizes (6) subject to the constraints 0 pk i 1 and bk := Pn i=1 pk i m. It turns out that this problem has the following closed-form solution (see Appendix B for a proof): (m + l n) U k i Pl j=1 U k (j) , if i / Ak 1, if i Ak , (7) where U k i := wi Uk i , and U k (j) is the j-th smallest value in U k i n i=1, l is the largest integer for which 0 < m + l n Pl i=1 U k (i) / U k (l) (note that this inequality always holds for l = n m + 1), and Ak contains indices i such that U k i U k (l+1) . We summarize this procedure in Algorithm 1. Intuitively, our method can be thought of as uniform sampling with m [m, n] effective sampled clients, while only m clients are actually sampled in expectation, which indicates that it cannot be worse than uniform sampling and can be as good as full participation. The actual value of m depends on the updates. Remark 2 (Optimality). Optimizing the left-hand side of (5) does not guarantee the proposed sampling to be optimal with respect to the right-hand side of (5) in the general case. For this to hold, our sampling needs to be independent, which is not a very restrictive condition, especially considering that enforcing independent sampling across clients accommodates the privacy requirements of FL. In addition, since (5) is tight, our sampling is optimal if one is allowed to communicate only norms (i.e., one float per client) as extra information. We stress that requiring optimality with respect to the left-hand side of (5) in the full general case is not practical, as it cannot be obtained without revealing, i.e., communicating, all clients full updates to the master. 2.2 Ensuring Compatibility with Secure Aggregation and Stateless Clients In the case l = n, the optimal probabilities pk i = m U k i /Pn j=1 U k j can be computed easily: the master aggregates the norm of each update and then sends the sum back to the clients. However, if l < n, in order Published in Transactions on Machine Learning Research (08/2022) Algorithm 2 Approximate Optimal Client Sampling (AOCS). 1: Input: expected batch size m, maximum number of iteration jmax 2: each client i computes an update Uk i (in parallel) 3: each client i sends the norm of its update uk i = wi Uk i to the master (in parallel) 4: master aggregates uk = Pn i=1 uk i 5: master broadcasts uk to all clients 6: each client i computes pk i = min{ muk i uk , 1} (in parallel) 7: for j = 1, , jmax do 8: each client i sends tk i = (1, pk i ) to the master if pk i < 1; else sends tk i = (0, 0) (in parallel) 9: master aggregates (Ik, P k) = Pn i=1 tk i 10: master computes Ck = m n+Ik P k 11: master broadcasts Ck to all clients 12: each client i recalibrates pk i = min{Ckpk i , 1} if pk i < 1 (in parallel) 13: if Ck 1 then 14: break 15: end if 16: end for 17: each clients i sends its update wi pk i Uk i to master with probability pk i (in parallel) to compute optimal probabilities, the master would need to identify the norm of every update and perform partial sorting, which can be computationally expensive and also violates the client privacy requirements in FL, i.e., one cannot use the secure aggregation protocol where the master only sees the sum of the updates. Therefore, we create an algorithm for approximately solving this problem, which only requires to perform aggregation at the master node without compromising the privacy of any client. The construction of this algorithm is built upon the greedy algorithm from Wangni et al. (2018, Algorithm 3) which was originally designed for gradient sparsification but solves up to a scale an equivalent mathematical problem. We first set pk i = m U k i /Pn j=1 U k j and pk i = min{ pk i , 1}. In the ideal situation where every pk i equals the optimal solution (7), this would be sufficient. However, due to the truncation operation, the expected number of sampled clients bk = Pn i=1 pk i Pn i=1 m U k i /Pn j=1 U k j = m can be strictly less than m if pk i > 1 holds true for at least one i. Hence, we employ an iterative procedure to fix this gap by rescaling the probabilities which are smaller than 1, as summarized in Algorithm 2. This algorithm is much easier to implement and computationally more efficient on parallel computing architectures. In addition, it only requires a secure aggregation procedure on the master, which is essential in privacy preserving FL, and thus it is compatible with existing FL software and hardware. Remark 3 (Extra communications in Algorithm 2). We acknowledge that Algorithm 2 brings extra communication costs, as it requires all clients to send the norms of their updates uk i s and probabilities pk i s in each round. However, since these are single floats, this only costs O(jmax) extra floats for each client. Picking jmax = O(1), this is negligible for large models of size d. We also acknowledge that engaging in multiple synchronous rounds of communication (as in Algorithm 2) can be a bottleneck (Huba et al., 2022). This is not an issue in our work, as we focus on reducing the total communication cost. However, Algorithm 2 may be less useful under other setups or metrics. Remark 4 (Fairness). Based on our sampling strategy, it might be tempting to assume that the obtained solution could exhibit fairness issues. In our convergence analyses below, we show that this is not the case, as our proposed methods converge to the optimal solution of the original problem. Hence, as long as the original objective has no inherent issue with fairness, our method does not exhibit any fairness issues. Besides, our algorithm can be used in conjunction with other more fair objectives, e.g., Tilted ERM (Li et al., 2021), if needed. Published in Transactions on Machine Learning Research (08/2022) 3 Convergence Guarantees This section provides convergence analyses for DSGD and Fed Avg with our optimal client sampling scheme in both convex and non-convex settings. We compare the convergence results of our scheme with those of full participation and independent uniform sampling with sample size m. We match the forms of our convergence bounds to those of the existing bounds in the literature to make them directly comparable. We do not compare the sample complexities of these methods, as such comparisons would be difficult due to their dependence on the actual updates which are unknown in advance and do not follow a specific distribution in general. We use standard assumptions (Karimi et al., 2016), assuming throughout that f has a unique minimizer x with f = f(x ) > and fi s are L-smooth, i.e., fi s have L-Lipschitz continuous gradients. We first define convex functions and L-smooth functions. Definition 5 (Convexity). f : Rd R is µ-strongly convex with µ > 0 if f(y) f(x) + f(x), y x + µ 2 y x 2 , x, y Rd. (8) f : Rd R is convex if it satisfies (8) with µ = 0. Definition 6 (Smoothness). f : Rd R is L-smooth if f(x) f(y) L x y , x, y Rd. (9) We now state standard assumptions of the gradient oracles for DSGD and Fed Avg. Assumption 7 (Gradient oracle for DSGD). The stochastic gradient estimator gk i = fi(xk) + ξk i of the local gradient fi(xk), for each round k and all i = 1, . . . , n, satisfies E ξk i = 0 (10) and E h ξk i 2 |xk i i M fi(xk) 2 + σ2, for some M 0. (11) This further implies that E 1 n Pn i=1 gk i | xk = f(xk). Assumption 8 (Gradient oracle for Fed Avg). The stochastic gradient estimator gi(yk i,r) = fi(yk i,r) + ξk i,r of the local gradient fi(yk i,r), for each round k, each local step r = 0, . . . , R and all i = 1, . . . , n, satisfies E ξk i,r = 0 (12) and E h ξk i,r 2 |yk i,r i M fi(yk i,r) 2 + σ2, for some M 0, (13) where yk i,0 = xk and yk i,r = yk i,r 1 ηlgi(yk i,r), for r = 1, , R. For non-convex objectives, one can construct counter-examples that would diverge for both DSGD and Fed Avg if the sampling variance is not bounded. Therefore, we need to employ the following standard assumption of local gradients for bounding the sampling variance3. Assumption 9 (Similarity among local gradients). The gradients of local loss functions fi satisfy i=1 wi fi(x) f(x) 2 ρ, for some ρ 0. (14) 3This assumption is not required for convex objectives, as one can show that the sampling variance is bounded using smoothness and convexity. Published in Transactions on Machine Learning Research (08/2022) Remark 10 (Interpretation of Assumption 9). Some works employ a more restrictive assumption which requires fi(x) f(x) ρ, i, from which Assumption 9 can be derived, since Pn i=1 wi = 1. Therefore, Assumption 9 can be seen as an assumption on similarity among local gradients. Furthermore, this assumption does not require wi s to be lower-bounded, as clients with wi = 0 will never be sampled and thus can be removed from the objective. We now define some important quantities for our convergence analyses. Definition 11 (The improvement factor). We define the improvement factor of optimal client sampling over uniform sampling: pk i Uk i Pn i=1 wi Uk i 2 p U i Uk i Pn i=1 wi Uk i 2 , (15) where Sk Sk with pk i defined in (7) and U k U is an independent uniform sampling with p U i = m/n. By construction, 0 αk 1, as Sk minimizes the variance term (see Appendix B for a proof). Note that αk can reach zero in the case where there are at most m non-zero updates. If αk = 0, our method performs as if all updates were communicated. In the worst-case αk = 1, our method performs as if we picked m updates uniformly at random, and one could not do better in theory due to the structure of the updates Uk i . The actual value of αk will depend on the updates Uk i . We also define the relative improvement factor: γk := m αk(n m) + m hm n , 1 i , k = 0, . . . , K 1. (16) Definition 12 (Simplified notation). For simplicity of notation, we define the following quantities which will be useful for our convergence analyses: W := max i [n]{wi}, Zi := fi(x ) f i , rk := xk x , (17) where f i is the functional value of fi at its optimum, Zi represents the mismatch between the local and global minimizer, and rk captures the distance between the current point and the minimizer of f. We are now ready to proceed with our convergence analyses. In the following subsections, we provide convergence analyses of specific methods for solving the optimization problem (1). The proofs of the theorems are deferred to Appendices C and D. 3.1 Distributed SGD (DSGD) with Optimal Client Sampling This subsection presents convergence analyses for DSGD (2) with optimal client sampling in both convex and non-convex settings. Theorem 13 (DSGD, strongly-convex). Let fi be L-smooth and convex for i = 1, . . . , n. Let f be µ-strongly convex. Suppose that Assumption 7 holds. Choose ηk 0, γk (1+W M)L i . Define i=1 w2 i (2L(1 + M)Zi + σ2), (18) i=1 w2 i Zi. (19) The iterates of DSGD with optimal client sampling (7) satisfy E h rk+1 2i (1 µηk)E h rk 2i + (ηk)2 β1 Published in Transactions on Machine Learning Research (08/2022) Remark 14 (Interpretation of Theorem 13). We first look at the best and worst case scenarios. In the best case scenario, we have γk = 1 for all k s. This implies that there is no loss of speed comparing to the method with full participation. It is indeed confirmed by our theory as our obtained recursion recovers the best-known rate of DSGD in the full participation regime (Gower et al., 2019, Theorem 3.1). To provide a better intuition, we include a full derivation in this case. To match their (stronger) assumptions, we let M = 0 and wi = 1/n. In full participation, we have γk = 1 for all k s. Then, taking the same step size η for all k leads to E h rk+1 2i (1 µη)E h rk 2i + η2 σ2 Applying the above inequality recursively yields E h r K 2i (1 µη)KE h r0 2i + η σ2 which is equivalent to the result in Gower et al. (2019, Theorem 3.1). Similarly, in the worst case, we have γk = m/n for all k s, which corresponds to uniform sampling with sample size m, and our recursion recovers the best-known rate for DSGD in this regime. This is expected as (15) implies that every update Uk i is equivalent, and thus it is theoretically impossible to obtain a better rate than that of uniform sampling in the worst case scenario. In the general scenario, our obtained recursion sits somewhere between full and uniform partial participation, where the actual position is determined by γk s which capture the distribution of updates (here gradients) on the clients. For instance, with a larger number of γk s tending to 1, we are closer to the full participation regime. Similarly, with more γk s tending to m/n, we are closer to the rate of uniform partial participation. Theorem 15 (DSGD, non-convex). Let fi be L-smooth for i = 1, . . . , n. Suppose that Assumptions 7 and 9 hold. Let ηk be the step size and define βk := L 2γk (1 + M γk)Wρ + i=1 w2 i σ2 ! The iterates of DSGD with optimal client sampling (7) satisfy E f(xk+1) E f(xk) ηk 1 (1 + M)L 2γk ηk E h f(xk) 2i + (ηk)2βk. (24) Remark 16 (Interpretation of Theorem 15). The iterate (24) recovers the standard form of the convergence result of DSGD for one recursion step in the non-convex setting. Similar to the previous results, this convergence bound sits between the best-known rate of full participation and uniform sampling (Bottou et al., 2018, Theorem 4.8). 3.2 Federated Averaging (Fed Avg) with Optimal Client Sampling Pseudo-code that adapts the standard Fed Avg algorithm to our framework is provided in Algorithm 3. This subsection presents convergence analyses for Fed Avg with optimal client sampling in both convex and non-convex settings. Theorem 17 (Fed Avg, strongly-convex). Let fi be L-smooth and µ-strongly convex for i = 1, . . . , n. Suppose that Assumption 8 holds. Let ηk := Rηk l ηk g be the effective step-size and ηk g r i w2 i . Choose ηk 0, 1 8 min n 1 L(2+M/R), γk (1+W (1+M/R))L oi , βk 1 := 2σ2 i=1 w2 i + 4L M R + 1 γk n X i=1 w2 i Zi, (25) β2 := 72L2 1 + M i=1 wi Zi. (26) Published in Transactions on Machine Learning Research (08/2022) Algorithm 3 Fed Avg with Optimal Client Sampling. 1: Input: initial global model x1, global and local step-sizes ηk g, ηk l 2: for each round k = 1, . . . , K do 3: master broadcasts xk to all clients i [n] 4: for each client i [n] (in parallel) do 5: initialize local model yk i,0 xk 6: for r = 1, . . . , R do 7: compute mini-batch gradient gi(yk i,r 1) 8: update yk i,r yk i,r 1 ηk l gi(yk i,r 1) 9: end for 10: compute Uk i := yk i = xk yk i,R 11: compute pk i using Algorithm 1 or 2 12: send wi pk i yk i to master with probability pk i 13: end for 14: master computes xk = P 15: master updates global model xk+1 xk ηk g xk 16: end for The iterates of Fed Avg (R 2) with optimal client sampling (7) satisfy 3 8E (f(xk) f ) 1 E h rk 2i 1 ηk E h rk+1 2i + ηkβk 1 + (ηk)2β2. (27) Theorem 18 (Fed Avg, non-convex). Let fi be L-smooth for all i = 1, . . . , n. Suppose that Assumptions 8 and 9 hold. Let ηk := Rηk l ηk g be the effective step-size and ηk g r i w2 i . Choose ηk 0, 1 8L(2+M/R) i . The iterates of Fed Avg (R 2) with optimal client sampling (7) satisfy E f(xk+1) E f(xk) 3ηk E h f(xk) 2i + ηk ρ 8 + (ηk)2βk. (29) Remark 19 (Interpretation of Theorems 17 and 18). The convergence guarantees from Theorems 17 and 18 sit somewhere between those for full and uniform partial participation. The actual position is again determined by the distribution of the updates which are linked to γk s. In the edge cases, i.e., γk = 1 (best case) or γk = m/n (worst case), we recover the state-of-the-art complexity guarantees provided in (Karimireddy et al., 2019, Theorem I) in both regimes. Note that our results are slightly more general, as Karimireddy et al. (2019) assumes M = 0 and wi = 1/n. 4 Related Work This section reviews prior works that are closely or broadly related to our proposed method. 4.1 Importance Client Sampling in Federated Learning Several recent works have studied efficient importance client sampling methods in FL (Cho et al., 2020; Nguyen et al., 2020; Ribero & Vikalo, 2020; Lai et al., 2021; Luo et al., 2022). Unfortunately, none of Published in Transactions on Machine Learning Research (08/2022) these methods is principled, as they rely on heuristics, historical losses, or partial information, which can be seen as proxies for our optimal client sampling. Furthermore, they violate at least one of the core privacy requirements of FL (secure aggregation and/or stateless clients). Specifically, the client selection strategy proposed by Lai et al. (2021) is based on the heuristic of system and statistical utility of clients, which reveals the identity of clients; Ribero & Vikalo (2020) propose to model the progression of the model parameters by an Ornstein-Uhlenbeck process based on partial information, where the master needs to process the raw update from each client. The work of Cho et al. (2020) biases client selection towards clients with higher local losses, which reveals the state of each individual client. In contrast, our proposed method is the first principled optimal client sampling strategy in the sense that it minimizes the variance of the master update and is compatible with core privacy requirements of FL. We note that the client sampling/selection techniques mentioned in this section could be made compatible with our framework presented in Section 2, but they would not lead to the optimal method as they are only proxies for optimal sampling. 4.2 Importance Sampling in Stochastic Optimization Importance sampling methods for optimization have been studied extensively in the last few years in several contexts, including convex optimization and deep learning. LASVM developed in Bordes et al. (2005) is an online algorithm that uses importance sampling to train kernelized support vector machines. The first importance sampling for randomized coordinate descent methods was proposed in the seminal paper of Nesterov (2012). It was showed by Richtárik & Takáč (2014) that the proposed sampling is optimal. Later, several extensions and improvements followed, e.g., Shalev-Shwartz & Zhang (2014); Lin et al. (2014); Fercoq & Richtárik (2015); Qu et al. (2015); Allen-Zhu et al. (2016); Stich et al. (2017). Another branch of work studies sample complexity. In Needell et al. (2014); Zhao & Zhang (2015), the authors make a connection with the variance of the gradient estimates of SGD and show that the optimal sampling distribution is proportional to the per-sample gradient norm. However, obtaining this distribution is as expensive as computing the full gradient in terms of computation, and thus it is not practical. For simpler problems, one can sample proportionally to the norms of the inputs, which can be linked to the Lipschitz constants of the per-sample loss function for linear and logistic regression. For instance, it was shown by Horváth & Richtárik (2019) that static optimal sampling can be constructed even for mini-batches and the probability is proportional to these Lipschitz constants under the assumption that these constants of the per-sample loss function are known. Unfortunately, importance measures such as smoothness of the gradient are often hard to compute/estimate for more complicated models such as those arising in deep learning, where most of the importance sampling schemes are based on heuristics. For instance, a manually designed sampling scheme was proposed in Bengio et al. (2009). It was inspired by the perceived way that human children learn; in practice, they provide the network with examples of increasing difficulty in an arbitrary manner. In a diametrically opposite approach, it is common for deep embedding learning to sample hard examples because of the plethora of easy non-informative ones (Schroffet al., 2015; Simo-Serra et al., 2015). Other approaches use history of losses for previously seen samples to create the sampling distribution and sample either proportionally to the loss or based on the loss ranking (Schaul et al., 2015; Loshchilov & Hutter, 2015). Katharopoulos & Fleuret (2018) propose to sample based on the gradient norm of a small uniformly sampled subset of samples. Although our proposed optimal sampling method adapts and extends the importance sampling results from Horváth & Richtárik (2019) to the distributed setting of FL, it does not suffer from any of the limitations discussed above, since the motivation of our work is to reduce communication rather than reduce computation. In particular, our method allows for any budge m < n on the number of participating clients, which generalizes the theoretical results from Zhao & Zhang (2015) which only applies to the case m = 1. 5 Experiments This section empirically evaluates our optimal client sampling method on standard federated datasets from LEAF (Caldas et al., 2018). Published in Transactions on Machine Learning Research (08/2022) 0 20 40 60 80 100 120 The Number of Training Images a Client Holds 0 250 500 750 1000 1250 1500 1750 2000 The Number of Clients Dataset 1 (32,906 training images in total) 10% clients hold 82% training images 0 20 40 60 80 100 120 The Number of Training Images a Client Holds 0 250 500 750 1000 1250 1500 1750 2000 The Number of Clients Dataset 2 (29,906 training images in total) 20% clients hold 90% training images 0 20 40 60 80 100 120 The Number of Training Images a Client Holds 0 250 500 750 1000 1250 1500 1750 2000 The Number of Clients Dataset 3 (27,599 training images in total) 50% clients hold 98% training images Figure 2: Distributions of the three modified Federated EMNIST training sets. We compare our method with 1) full participation where all available clients participate in each round; and 2) the baseline where participating clients are sampled uniformly from available clients in each round. We chose not to compare with other client sampling methods, as such comparisons would be unfair. This is because they violate the privacy requirements of FL: our method is the only importance client sampling strategy that is deployable to real-world FL systems (cf. Section 4.1). We simulate the cross-device FL distributed setting and train our models using Tensor Flow Federated (TFF). We conclude our evaluations using Fed Avg with Algorithm 2, as it supports stateless clients and secure aggregation4. We extend the TFF implementation of Fed Avg to fit our framework. For all three methods, we report validation accuracy and (local) training loss as a function of the number of communication rounds and the number of bits communicated from clients to the master5 . Each figure displays the mean performance with standard deviation over 5 independent runs for each of the three compared methods. For a fair comparison, we use the same random seed for all three methods in a single run and vary random seeds across different runs. Detailed experimental settings and extra results can be found in Appendices F.1 and F.2. Our code together with datasets can be found at https://github.com/Samuel Horvath/FL-optimal-client-sampling. 5.2 Federated EMNIST Dataset We first evaluate our method on the Federated EMNIST (FEMNIST) image dataset for image classification. Since it is a well-balanced dataset with data of similar quality on each client, we modify its training set by removing some images from some clients, in order to better simulate the conditions in which our proposed method brings significant theoretical improvements. As a result, we produce three unbalanced training sets6 as summarized in Figure 2. We use the same CNN model as the one used in (Mc Mahan et al., 2017). For validation, we use the unchanged EMNIST validation set, which consists of 40, 832 images. In each communication round, n = 32 clients are sampled uniformly from the client pool, each of which then performs several SGD steps on its local training images for 1 epoch with batch size 20. For partial participation, the expected number of clients allowed to communicate their updates back to the master is set to m {3, 6}. We use vanilla SGD optimizers with constant step sizes for both clients and the master, with ηg = 1 and ηl tuned on a holdout set. For full participation and optimal sampling, it turns out that ηl = 2 3 is the optimal local step size for all three datasets. For uniform sampling, the optimal is ηl = 2 5 for Dataset 1 and ηl = 2 4 for Datasets 2 and 3. We set jmax = 4 and include the extra communication costs in our results. The main results are shown in Figures 3, 4 and 5. 4We compared the results of Algorithms 1 and 2 for all experiments as a subroutine. Their results are identical, so we only show results for Algorithm 2 and argue that the performance loss caused by its approximation is negligible. 5The communication from the master to clients is not considered as a bottleneck and thus not included in the results. This is a standard consideration for distributed systems, as one-to-many communication primitives (i.e., from the master to clients) are several orders of magnitude faster than many-to-one communication primitives (i.e., from clients to the master). This gap is further exacerbated in FL due to the large number of clients and slow client connections. 6The aim of creating various unbalanced datasets is to show that optimal sampling has more performance gains over uniform sampling on more unbalanced datasets, since αk s (defined in Equation (15)) are more likely to be close to zero in this case. These datasets are created using the following procedure. Let s (0, 1) and a, b N+ with a < b. For a given client with nc examples, we keep this client unchanged if nc a or nc b, otherwise we remove this client from the dataset with probability s or only keep a randomly sampled examples in this client with probability 1 s. Published in Transactions on Machine Learning Research (08/2022) 0 20 40 60 80 100 120 140 160 Communication Round Validation Accuracy optimal sampling (m=3, n=32) uniform sampling (m=3, n=32) optimal sampling (m=6, n=32) uniform sampling (m=6, n=32) full participation (m=32, n=32) 0 20 40 60 80 100 120 140 160 Communication Round (Local) Training Loss 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) Validation Accuracy 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) (Local) Training Loss Figure 3: (FEMNIST Dataset 1, n = 32) Validation accuracy and (local) training loss as a function of the number of communication rounds and the number of bits communicated from clients to the master. 0 20 40 60 80 100 120 140 160 Communication Round Validation Accuracy optimal sampling (m=3, n=32) uniform sampling (m=3, n=32) optimal sampling (m=6, n=32) uniform sampling (m=6, n=32) full participation (m=32, n=32) 0 20 40 60 80 100 120 140 160 Communication Round (Local) Training Loss 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) Validation Accuracy 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) (Local) Training Loss Figure 4: (FEMNIST Dataset 2, n = 32) Validation accuracy and (local) training loss as a function of the number of communication rounds and the number of bits communicated from clients to the master. 0 20 40 60 80 100 120 140 160 Communication Round Validation Accuracy optimal sampling (m=3, n=32) uniform sampling (m=3, n=32) optimal sampling (m=6, n=32) uniform sampling (m=6, n=32) full participation (m=32, n=32) 0 20 40 60 80 100 120 140 160 Communication Round (Local) Training Loss 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) Validation Accuracy 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) (Local) Training Loss Figure 5: (FEMNIST Dataset 3, n = 32) Validation accuracy and (local) training loss as a function of the number of communication rounds and the number of bits communicated from clients to the master. 5.3 Shakespeare Dataset We also evaluate our method on the Shakespeare text dataset for next character prediction. Unlike in the FEMNIST experiments, we do not change the number of examples held by each client in this dataset. The vocabulary set for this task consists of 86 unique characters. The dataset contains 715 clients, each corresponding to a character in Shakespeare s plays. We divide the text into batches such that each batch contains 8 example sequences of length 5. We use a two-hidden-layer GRU model with 256 units in each hidden layer. We set n {32, 128}, m {2, 4, 6, 12}, jmax = 4, and run several SGD steps for 1 epoch on each client s local dataset in every communication round. We use vanilla SGD optimizers with constant step sizes, with ηg = 1 and ηl tuned on a holdout set. For full participation and optimal sampling, it turns out that the optimal is ηl = 2 2. For uniform sampling, the optimal is ηl = 2 3. The main results are shown in Figures 6 and 7. 5.4 Discussions As predicted by our theory, the performance of Fed Avg with our proposed optimal client sampling strategy is in between that with full and uniform partial participation. For all datasets, the optimal sampling strategy performs slightly worse than but is still competitive with the full participation strategy in terms of the number of communication rounds: it almost reached the performance of full participation while only less than 10% of the available clients communicate their updates back to the master (in the cases m = 2, 3). As we increase the expected number m of sampled clients, the performance of optimal sampling increases accordingly, which is consistent with our theory (e.g., Theorem 18) and with the observations from Yang et al. (2021), and quickly becomes almost identical to that of full participation. Note that the uniform sampling strategy performs significantly worse, which indicates that a careful choice of sampling probabilities can go a long way towards closing the gap between the performance of naive uniform sampling and full participation. Published in Transactions on Machine Learning Research (08/2022) 0 20 40 60 80 100 120 140 160 Communication Round Validation Accuracy optimal sampling (m=2, n=32) optimal sampling (m=4, n=32) optimal sampling (m=6, n=32) optimal sampling (m=12, n=32) full participation (m=32, n=32) uniform sampling (m=2, n=32) uniform sampling (m=4, n=32) uniform sampling (m=6, n=32) uniform sampling (m=12, n=32) 0 20 40 60 80 100 120 140 160 Communication Round (Local) Training Loss 20 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) Validation Accuracy 20 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) (Local) Training Loss Figure 6: (Shakespeare Dataset, n = 32) Validation accuracy and (local) training loss as a function of the number of communication rounds and the number of bits communicated from clients to the master. 0 20 40 60 80 100 120 140 160 Communication Round Validation Accuracy optimal sampling (m=2, n=128) optimal sampling (m=4, n=128) optimal sampling (m=6, n=128) optimal sampling (m=12, n=128) full participation (m=128, n=128) uniform sampling (m=2, n=128) uniform sampling (m=4, n=128) uniform sampling (m=6, n=128) uniform sampling (m=12, n=128) 0 20 40 60 80 100 120 140 160 Communication Round (Local) Training Loss 20 22 24 26 28 210 212 Bits Communicated from Clients to the Master ( 108) Validation Accuracy 20 22 24 26 28 210 212 Bits Communicated from Clients to the Master ( 108) (Local) Training Loss Figure 7: (Shakespeare Dataset, n = 128) Validation accuracy and (local) training loss as a function of the number of communication rounds and the number of bits communicated from clients to the master. Also, it can be seen that the performances of our optimal client sampling strategy with m = 6 and m = 12 match the performances of full participation in the cases n = 32 and n = 128, respectively, in terms of the number of communication rounds. We therefore conjecture that m = O( n) is sufficient for our optimal client sampling strategy to obtain identical validation accuracy to that of full participation in terms of the number of communication rounds. More importantly, and this was the main motivation of our work, our optimal sampling strategy is significantly better than both the uniform sampling and full participation strategies when we compare validation accuracy as a function of the number of bits communicated from clients to the master. For instance, on FEMNIST Dataset 1 (Figure 3), while our optimal sampling approach with m = 3 reached around 85% validation accuracy after 26 108 communicated bits, neither the full sampling strategy nor the uniform sampling strategy with m = 3 is able to exceed 40% validation accuracy within the same communication budget. Indeed, to reach the same 85% validation accuracy, full participation approach needs to communicate more than 29 108 bits, i.e., 8 more, and uniform sampling approach needs to communicate about the same number of bits as full participation or even more. The results for FEMNIST Datasets 2 and 3 and for the Shakespeare dataset are of a similar qualitative nature, showing that these conclusions are robust across the datasets considered. Finally, it is also worth noting that the empirical results from Sections 5.2 and 5.3 confirm that our optimal sampling strategy allows for larger step sizes than uniform sampling, as the hyperparameter search returns larger step sizes ηl for optimal sampling than for uniform sampling. In Appendix G, we present an additional experiment on the Federated CIFAR100 dataset from LEAF. The Federated CIFAR100 dataset is a balanced dataset, where every client holds the same number of training images. In this setting, letting all clients perform 1 epoch of local training means that all clients have the same number of local steps in each round. We show that our optimal client sampling scheme still achieves better performance than uniform sampling on this balanced dataset. 6 Conclusion and Future Work In this work, we have proposed a principled optimal client sampling strategy to address the communication bottleneck issue of federated learning. Our optimal client sampling can be computed using a closed-form formula by aggregating only the norms of the updates. Furthermore, our method is the first principled importance client sampling strategy that is compatible with stateless clients and secure aggregation. We have obtained convergence guarantees for our method with DSGD and Fed Avg with relaxed assumptions, and have performed empirical evaluations of our method on federated datasets from the LEAF database. The Published in Transactions on Machine Learning Research (08/2022) empirical results show that our method is superior to uniform sampling and close to full participation, which corroborates our theoretical analysis. We believe that our proposed optimal client sampling scheme will be useful in reducing communication costs in real-world FL systems. Some directions for future work are as follows: A straightforward extension would be to combine our proposed optimal sampling approach with communication compression methods to further reduce the sizes of communicated updates. In the settings where the communication latency is high, our proposed method may not be effective in reducing the real communication time. It would be interesting to extend our optimal client sampling strategy to take into account the constraints of local clients (e.g., computational speed, network bandwidth, and communication latency). Acknowledgments We thank Jakub Konečný for helpful discussions and comments. Most of the work was done when WC was a research intern at KAUST and when SH was a Ph D student at KAUST. Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30, 2017. Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In International Conference on Machine Learning, pp. 1110 1119, 2016. Debraj Basu, Deepesh Data, Can Karakus, and Suhas Diggavi. Qsparse-local-SGD: Distributed SGD with quantization, sparsification and local computations. In Advances in Neural Information Processing Systems, pp. 14668 14679, 2019. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41 48, 2009. Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan Mc Mahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175 1191, 2017. Antoine Bordes, Seyda Ertekin, Jason Weston, and Léon Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6(Sep):1579 1619, 2005. Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311, 2018. Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečn y, H Brendan Mc Mahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. ar Xiv preprint ar Xiv:1812.01097, 2018. Yae Jee Cho, Jianyu Wang, and Gauri Joshi. Client selection in federated learning: Convergence analysis and power-of-choice selection strategies. ar Xiv preprint ar Xiv:2010.01243, 2020. Wenliang Du and Mikhail J Atallah. Secure multi-party computation problems and their applications: a review and open problems. In Workshop on New Security Paradigms, 2001. Olivier Fercoq and Peter Richtárik. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997 2023, 2015. WM Goodall. Television by pulse code modulation. Bell System Technical Journal, 30(1):33 49, 1951. Published in Transactions on Machine Learning Research (08/2022) Slawomir Goryczka and Li Xiong. A comprehensive comparison of multiparty secure additions with differential privacy. IEEE Transactions on Dependable and Secure Computing, 14:463 477, 2015. Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. SGD: General analysis and improved rates. Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, 2019. Filip Hanzely and Peter Richtárik. Federated learning of a mixture of global and local models. ar Xiv:2002.05516, 2020. Samuel Horváth and Peter Richtárik. Nonconvex variance reduced optimization with arbitrary sampling. Proceedings of the 36th International Conference on Machine Learning, 2019. Samuel Horváth and Peter Richtárik. A better alternative to error feedback for communication-efficient distributed learning. ar Xiv preprint ar Xiv:2006.11077, 2020. Samuel Horváth, Chen-Yu Ho, Ľudovit Horváth, Atal Narayan Sahu, Marco Canini, and Peter Richtárik. Natural compression for distributed deep learning. ar Xiv preprint ar Xiv:1905.10988, 2019. Junxian Huang, Feng Qian, Yihua Guo, Yuanyuan Zhou, Qiang Xu, Z Morley Mao, Subhabrata Sen, and Oliver Spatscheck. An in-depth study of LTE: Effect of network protocol and application behavior on performance. SIGCOMM Computer Communication Review, 43:363 374, 2013. Dzmitry Huba, John Nguyen, Kshitiz Malik, Ruiyu Zhu, Mike Rabbat, Ashkan Yousefpour, Carole-Jean Wu, Hongyuan Zhan, Pavel Ustinov, Harish Srinivas, et al. Papaya: Practical, private, and scalable federated learning. Proceedings of Machine Learning and Systems, 4:814 832, 2022. Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795 811. Springer, 2016. Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for on-device federated learning. ar Xiv preprint ar Xiv:1910.06378, 2019. Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. ar Xiv preprint ar Xiv:1803.00942, 2018. Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local SGD on identical and heterogeneous data. In The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020. Jakub Konečný and Peter Richtárik. Randomized distributed mean estimation: Accuracy vs. communication. Frontiers in Applied Mathematics and Statistics, 4:62, 2018. Fan Lai, Xiangfeng Zhu, Harsha V Madhyastha, and Mosharaf Chowdhury. Oort: Efficient federated learning via guided participant selection. In 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21), pp. 19 35, 2021. Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id= K5Yas WXZT3O. Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems, pp. 3059 3067, 2014. Published in Transactions on Machine Learning Research (08/2022) Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don t use large mini-batches, use local SGD. ar Xiv preprint ar Xiv:1808.07217, 2018. Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. ar Xiv preprint ar Xiv:1511.06343, 2015. Bing Luo, Wenli Xiao, Shiqiang Wang, Jianwei Huang, and Leandros Tassiulas. Tackling system and statistical heterogeneity for federated learning with adaptive client sampling. In IEEE INFOCOM 2022IEEE Conference on Computer Communications, pp. 1739 1748. IEEE, 2022. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationefficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273 1282, 2017. Konstantin Mishchenko, Filip Hanzely, and Peter Richtárik. 99% of parallel optimization is inevitably a waste of time. ar Xiv preprint ar Xiv:1901.09437, 2019. Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in neural information processing systems, pp. 1017 1025, 2014. Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012. Hung T Nguyen, Vikash Sehwag, Seyyedali Hosseinalipour, Christopher G Brinton, Mung Chiang, and H Vincent Poor. Fast-convergent federated learning. IEEE Journal on Selected Areas in Communications, 39(1):201 218, 2020. Zheng Qu, Peter Richtárik, and Tong Zhang. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems 28, pp. 865 873, 2015. Ali Ramezani-Kebrya, Fartash Faghri, and Daniel M Roy. NUQSGD: Improved communication efficiency for data-parallel SGD via nonuniform quantization. ar Xiv preprint ar Xiv:1908.06077, 2019. Monica Ribero and Haris Vikalo. Communication-efficient federated learning via optimal client sampling. ar Xiv preprint ar Xiv:2007.15197, 2020. Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1 38, 2014. Lawrence Roberts. Picture coding using pseudo-random noise. IRE Transactions on Information Theory, 8 (2):145 154, 1962. Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. ar Xiv preprint ar Xiv:1511.05952, 2015. Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815 823, 2015. Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In International conference on machine learning, pp. 64 72, 2014. Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pp. 118 126, 2015. Jinhyun So, Başak Güler, and A Salman Avestimehr. Turbo-aggregate: Breaking the quadratic aggregation barrier in secure federated learning. IEEE Journal on Selected Areas in Information Theory, 2(1):479 489, 2021. Published in Transactions on Machine Learning Research (08/2022) Sebastian U Stich. Local SGD converges fast and communicates little. ICLR 2019 - International Conference on Learning Representations, 2019. Sebastian U Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. ICLR 2020 - International Conference on Learning Representations, 2020. Sebastian U Stich, Anant Raj, and Martin Jaggi. Safe adaptive importance sampling. In Advances in Neural Information Processing Systems, pp. 4381 4391, 2017. Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, pp. 4447 4458, 2018. CH Van Berkel. Multi-core for mobile phones. In Conference on Design, Automation and Test in Europe, 2009. Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Power SGD: Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems, pp. 14236 14245, 2019. Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and Stephen Wright. Atomo: Communication-efficient learning via atomic sparsification. Advances in Neural Information Processing Systems, 31, 2018. Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pp. 1299 1309, 2018. Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pp. 1509 1519, 2017. Haibo Yang, Minghong Fang, and Jia Liu. Achieving linear speedup with partial worker participation in non-iid federated learning. ar Xiv preprint ar Xiv:2101.11203, 2021. Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 4035 4043. JMLR. org, 2017. Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pp. 1 9, 2015. Published in Transactions on Machine Learning Research (08/2022) A Proof of Lemma 1 Proof. Our proof technique can be seen as an extended version of that in (Horváth & Richtárik, 2019). Let 1i S = 1 if i S and 1i S = 0 otherwise. Likewise, let 1i,j S = 1 if i, j S and 1i,j S = 0 otherwise. Note that E [1i S] = pi and E [1i,j S] = pij. Next, let us compute the mean of X := P pi E [1i S] = i=1 wiζi = ζ. Let A = [a1, . . . , an] Rd n, where ai = wiζi pi , and let e be the vector of all ones in Rn. We now write the variance of X in a form which will be convenient to establish a bound: E h X E [X] 2i = E h X 2i E [X] 2 i,j pij wiζ i pi i,j wiwjζ i ζj i,j (pij pipj)a i aj = e ((P pp ) A A)e. Since, by assumption, we have P pp Diag(p v), we can further bound e ((P pp ) A A)e e (Diag(p v) A A)e = i=1 pivi ai 2 . To obtain (5), it remains to combine this with (30). The inequality vi 1 pi follows by comparing the diagonal elements of the two matrices in (4). Consider now the independent sampling. Clearly, p1(1 p1) 0 . . . 0 0 p2(1 p2) . . . 0 ... ... ... ... 0 0 . . . pn(1 pn) = Diag(p1v1, . . . , pnvn), which implies vi = 1 pi. B The Improvement Factor for Optimal Client Sampling By Lemma 1, the independent sampling (which operates by independently flipping a coin and with probability pi includes element i into S) is optimal. In addition, for independent sampling, (5) holds as equality. Thus, letting U k i = wi Uk i , we have wi pk i Uk i i=1 wi Uk i 1 pk i U k i 1 pk i pk i The optimal probabilities are obtained by minimizing (31) w.r.t. {pk i }n i=1 subject to the constraints 0 pk i 1 and m bk = Pn i=1 pk i . Published in Transactions on Machine Learning Research (08/2022) Lemma 20. The optimization problem min {pk i }n i=1 αSk({pk i }n i=1) s.t. 0 pk i 1, i = 1, , n and m i=1 pk i (32) has the following closed-form solution: (m + l n) U k i Pl j=1 U k (j) , if i / Ak 1, if i Ak , (33) where U k (j) is the j-th largest value among the values U k 1 , U k 2 , . . . , U k n , l is the largest integer for which 0 < m + l n Pl i=1 U k (i) U k (l) (note that this inequality at least holds for l = n m + 1), and Ak contains indices i such that U k i U k (l+1) . Proof. This proof uses an argument similar to that in the proof of Lemma 2 in Horváth & Richtárik (2019). We first show that (33) is the solution to the following optimization problem: min {pk i }n i=1 ΩSk({pk i }n i=1) := E s.t. 0 pk i 1, i = 1, , n and m The Lagrangian of this optimization problem is given by L({pk i }n i=1, {λi}n i=1, {ui}n i=1, y) = ΩSk({pk i }n i=1) i=1 ui(1 pi) y Since all constraints are linear and the support of {pk i }n i=1 is convex, the KKT conditions hold. Therefore, the solution (33) can be deduced from the KKT conditions. Now, notice that αSk({pk i }n i=1) and ΩSk({pk i }n i=1) are equal up to a constant E h Pn i=1 U k i 2i : αSk({pk i }n i=1) = ΩSk({pk i }n i=1) E This indicates that (33) is also the solution to the original optimization problem (32). Plugging the optimal probabilities obtained in (33) into (31) gives With m U k (n) Pn i=1 U k i , we have 1 m Pn i=1 U k i 2 Pn i=1 U k i 2 Published in Transactions on Machine Learning Research (08/2022) For independent uniform sampling U k U (p U i = m n for all i), we have wi p U i Uk i i=1 wi Uk i Putting them together gives the improvement factor: αk := α Sk αU k = E P pk i Uk i Pn i=1 wi Uk i 2 p U i Uk i Pn i=1 wi Uk i 2 E h Pn i=1 U k i 2i n E h Pn i=1 U k i 2i 1. The upper bound is attained when all U k i are identical. Note that the lower bound 0 can also be attained in the case where the number of non-zero updates is at most m. These considerations are discussed in the main paper. C DSGD with Optimal Client Sampling C.1 Proof of Theorem 13 Proof. L-smoothness of fi and the assumption on the gradient imply that the inequality E h gk i 2i 2L(1 + M)(fi(xk) fi(x ) + Zi) + σ2 holds for all k 0. We first take expectations over xk+1 conditioned on xk and over the sampling Sk: E h rk+1 2i = rk 2 2ηk E wi pk i gk i , rk + wi pk i gk i = rk 2 2ηk f(xk), rk + (ηk)2 wi pk i gk i (1 µηk) rk 2 2ηk f(xk) f + (ηk)2 wi pk i gk i wi pk i gk i i=1 w2 i gk i 2 # i=1 w2 i gk i fi(xk) 2 + fi(xk) 2 # i=1 w2 i ξk i 2 + fi(xk) 2 # i=1 w2 i 2L(1 + M)(fi(xk) fi(x ) + Zi) + σ2 2WL(1 + M)(f(xk) f ) + i=1 w2 i (2L(1 + M)Zi + σ2) Published in Transactions on Machine Learning Research (08/2022) i=1 wigk i f(xk) i=1 E h wigk i wi fi(xk) 2i + f(xk) 2 i=1 w2 i E h ξk i 2i + f(xk) 2 i=1 w2 i (2LM(fi(xk) f i ) + σ2) + 2L(f(xk) f ) = 2L (1 + WM) (f(xk) f ) + i=1 w2 i (2LMZi + σ2). Therefore, we obtain E h rk+1 2i (1 µηk) rk 2 2ηk f(xk) f 2L (1 + WM) (f(xk) f ) + i=1 w2 i (2LMZi + σ2) + (ηk)2αk n m 2WL(1 + M)(f(xk) f ) + i=1 w2 i (2L(1 + M)Zi + σ2) (1 µηk) rk 2 2ηk 1 ηk (αk(n m) + m)(1 + WM)L + (ηk)2 αk(n m) + m i=1 w2 i (2L(1 + M)Zi + σ2) i=1 w2 i Zi. Now choose any 0 < ηk m (αk(n m)+m)(1+W M)L and define i=1 w2 i (2L(1 + M)Zi + σ2), β2 := 2L i=1 w2 i Zi, γk := m αk(n m) + m hm Taking full expectation yields the desired result: E h rk+1 2i (1 µηk)E h rk 2i + (ηk)2 β1 C.2 Proof of Theorem 15 Proof. Using equation (2), we have f(xk+1) = f(xk ηk Gk) = f(xk) ηk Gk, f(xk) + (ηk)2 2 Gk, 2f(zk)Gk , for some zk Rd. Since all fi s are L-smooth, f is also L-smooth. Therefore, we have LI 2f(x) LI for all x Rd. Combining this with the fact that Gk is an unbiased estimator of f(xk), we have E f(xk+1) f(xk) ηk f(xk) 2 + (ηk)2L 2 E h Gk 2i , (34) Published in Transactions on Machine Learning Research (08/2022) where the expectations are conditioned on xk. In Appendix C.1, we already obtained the upper bound for the last term in equation (34): E h Gk 2i (1 + M)αk n m i=1 w2 i fi(xk) 2 + αk n m i=1 w2 i σ2 + f(xk) 2 i=1 w2 i fi(xk) 2 + 1 i=1 w2 i σ2 + f(xk) 2 . By Assumption 9, we further bound i=1 w2 i fi(xk) 2 W i=1 wi fi(xk) 2 i=1 wi fi(xk) f(xk) 2 + f(xk) 2 ! Wρ + f(xk) 2 . Combining the inequalities above and taking full expectation yields equation (24). D Fed Avg with Optimal Client Sampling Lemma 21 ((Karimireddy et al., 2019)). For any L-smooth and µ-strongly convex function h : Rd R and any x, y, z Rd, the following inequality holds h(x), z y h(z) h(y) + µ 4 y z 2 L z x 2 . (35) Proof. For any given x, y, and z, the two inequalities below follows by the smoothness and strong convexity of the function h: h(x), z x h(z) h(x) L h(x), x y h(x) h(y) + µ Further, applying the relaxed triangle inequality gives Combining all these inequalities together we have h(x), z y h(z) h(y) + µ 4 y z 2 L + µ The lemma follows by L µ. D.1 Proof of Theorem 17 Proof. The master update during round k can be written as (superscript k is dropped from here onward) pi gi(yi,r 1) and E [ηg x] = η i,r wi E [ fi(yi,r 1)] . Published in Transactions on Machine Learning Research (08/2022) Summations are always over i [n] and r [R] unless stated otherwise. Taking expectations over x conditioned on the results prior to round k and over the sampling S gives E h x ηg x x 2i = x x 2 2η i,r wi fi(yi,r 1), x x pi gi(yi,r 1) Applying Lemma 21 with h = wifi, x = yi,r 1, y = x and z = x gives wifi(x) wifi(x ) + wi µ 4 x x 2 wi L x yi,r 1 2 2η f(x) f + µ 4 x x 2 + 2LηE, where E is the drift caused by the local updates on the clients: i,r wi E h x yi,r 1 2i . (36) Bounding A2, we obtain 1 η2 A2 = E r gi(yi,r 1) X r gi(yi,r 1) r gi(yi,r 1) r gi(yi,r 1) r gi(yi,r 1) r fi(yi,r 1) r fi(yi,r 1) Published in Transactions on Machine Learning Research (08/2022) Using independence, zero mean and bounded second moment of the random variables ξi,r, we obtain 1 η2 A2 αn m r E h ξi,r 1 2i + E r fi(yi,r 1) i w2 i 1 R2 X r E h ξi,r 1 2i + E r fi(yi,r 1) r E h fi(yi,r 1) 2i + σ2 r E h fi(yi,r 1) 2i + σ2 r fi(yi,r 1) r E h fi(yi,r 1) fi(x) + fi(x) 2i r ( fi(yi,r 1) fi(x)) + f(x) r E h fi(yi,r 1) fi(x) 2i + 2E h fi(x) 2i! r ( fi(yi,r 1) fi(x)) + 2E h f(x) 2i . Combining the smoothness of fi s, the definition of E, and Jensen s inequality with definition γ := m α(n m)+m, we obtain i w2 i + 2 M WL2E + 2WL(f(x) f ) + 2L X + 2L2E + 4L(f(x) f(x )) i w2 i + 2L2 (1 W) + W R + 1 E + 4L 1 + 4L (1 W) + W R + 1 (f(x) f ). Putting these bounds on A1 and A2 together and using the fact that 1 W 1/γ yields E h x ηg x x 2i 1 µη x x 2 2η 1 2Lη R + 1 + 1 (f(x) f ) i w2 i + 4L 1 + 1 + ηL (1 W) + W R + 1 2LηE. Let η γ 8(1+W (1+M/R))L, then R + 1 + 1 , Published in Transactions on Machine Learning Research (08/2022) which in turn yields E h x ηg x x 2i 1 µη 2 (f(x) f ) i w2 i + 4L 1 + 1 + ηL (1 W) + W R + 1 2LηE. (37) Next, we need to bound the drift E. For R 2, we have E h yi,r x 2i = E h yi,r 1 x ηlgi(yi,r 1) 2i E h yi,r 1 x ηl fi(yi,r 1) 2i + η2 l (M fi(yi,r 1) 2 + σ2) E h yi,r 1 x 2i + (R + M)η2 l fi(yi,r 1) 2 + η2 l σ2 = 1 + 1 R 1 E h yi,r 1 x 2i + 1 + M Rη2g fi(yi,r 1) 2 + η2σ2 E h yi,r 1 x 2i + 1 + M Rη2g fi(yi,r 1) fi(x) 2 Rη2g fi(x) 2 + η2σ2 1 + 1 R 1 + 1 + M E h yi,r 1 x 2i + 1 + M Rη2g fi(x) 2 + η2σ2 If we further restrict η 1 8L(2+M/R), then for any ηg 1, we have 1 64L2 1 32R 1 32(R 1), and therefore, E h yi,r x 2i 1 + 33 32(R 1) E h yi,r 1 x 2i + 1 + M Rη2g fi(x) 2 + η2σ2 1 + 33 32(R 1) Rη2g fi(x) 2 + η2σ2 Rη2g fi(x) 2 + η2σ2 η2 fi(x) 2 + 8η2σ2 Published in Transactions on Machine Learning Research (08/2022) Hence, the drift is bounded by i wi fi(x) 2 + 8η2σ2 i wi(fi(x) f i ) + 8η2σ2 η2L(f(x) f ) + 32 1 + M i wi Zi + 8η2σ2 4η(f(x) f ) + 32 1 + M i wi Zi + 8η2σ2 Due to the upper bound on the step size η 1 8L(2+M/R), we have the inequalities 1 + ηL (1 W) + W 8 and 8ηL 1. (38) Plugging these to (37), we obtain E h x ηg x x 2i 1 µη 8η(f(x) f ) + η372L2 1 + M Rearranging the terms in the last inequality, taking full expectation and including superscripts lead to 3 8E (f(xk) f ) 1 E h xk x 2i 1 ηk E h xk+1 x 2i + (ηk)272L2 1 + M Plugging the assumption ηk g r i w2 i into the RHS of the above inequality completes the proof. D.2 Proof of Theorem 18 Proof. We drop superscript k and write the master update during round k as: pi gi(yi,r 1) := η . Published in Transactions on Machine Learning Research (08/2022) Summations are always over i [n] and r [R] unless stated otherwise. Taking expectations conditioned on x and using a similar argument as in the proof in Appendix C.2, we have E [f(x ηg x)] f(x) η f(x), E + η2L = f(x) η f(x) 2 + η f(x), f(x) E + η2L 2 f(x) 2 + η 2E h f(x) ES 2i + η2L where the last inequality follows since a, b 1 2 b 2 , a, b Rd. Since fi s are L-smooth, by the (relaxed) triangular inequality, we have η 2E h f(x) E 2i = η i,r wi ( fi(x) fi(yi,r 1)) i,r wi E h x yi,r 1 2i = ηL2 where E is the drift caused by the local updates on the clients as defined in (36). In Appendix D.1, we already obtained the upper bound for 1 η2 A2 = E h 2i : i w2 i + 2W M i wi fi(x) 2 ! + 2L2E + 2 f(x) 2 . Together with Assumption 9 that X i wi fi(x) 2 f(x) 2 X i wi fi(x) f(x) 2 ρ, i w2 i + 2W R + 1 γ L2E + f(x) 2 + ρ + 2L2E + 2 f(x) 2 . Combining the above inequalities gives E [f(x ηg x)] f(x) + η2 σ2L i w2 i + ηL2 ηL (1 W) + W + η ηL (1 W) + W + η ηL (1 W) + W Now, applying inequality (38) gives E [f(x ηg x)] f(x) + η2σ2L i w2 i + 5ηL2 8 f(x) 2 + η In Appendix D.1, we also obtained the upper bound for the drift E: i wi fi(x) 2 + 8η2σ2 η2( f(x) 2 + ρ) + 8η2σ2 Published in Transactions on Machine Learning Research (08/2022) Since 8ηL 8ηL(1 + M/R) 1, we have 8 E 10η3L2 1 + M ( f(x) 2 + ρ) + 5η3L2σ2 4 ( f(x) 2 + ρ) + 5η2Lσ2 This further simplifies the iterate to E [f(x ηg x)] f(x) 3 3 ηL f(x) 2 + 1 8η (1 + 2ηL) ρ + η2σ2L 5γ 4η2g + X Applying the assumption that ηg r i w2 i and taking full expectations completes the proof: E [f(x ηg x)] E [f(x)] 3 3 ηL E h f(x) 2i + η ρ 8 + η2 ρ 4 + σ2 E A Sketch of Results on Partial Participation This section discusses how our analysis can be extended to the case where not all clients are available to participate in each round. As an illustrative example, we consider Distributed SGD (DSGD), i.e., Uk i = gk i . If not all clients are available to participate in each communication round, we will assume that there is a known distribution of client availability Q such that in each step a subset Qk Q of clients are available to participate in a given communication round k. We denote the probability that client i is available in the current run by qi, i.e., qi = Prob(i Qk). Under this setting, we can apply twice tower property of the expectation and obtain the following variance decomposition: E h Gk f(xk) 2i i=1 wi Uk i i=1 wi Uk i f(xk) where we update the definition of Gk wi qipk i . (40) Note that Sk Qk as we can only sample from available clients. Furthermore, in the particular case where all clients are available, the above equations become identical to the ones that we present in the main paper. Upper-bounding Equation (39) in an analogous way as we proceed in our analysis in Appendices C and D would complete the proof of convergence for these settings. F Experimental Details F.1 Federated EMNIST Dataset We detail the hyper-parameters used in the experiments on the FEMNIST datasets. For each experiment, we run 151 communication rounds, reporting (local) training loss every round and validation accuracy every Published in Transactions on Machine Learning Research (08/2022) 0 20 40 60 80 100 120 140 160 Communication Round Validation Accuracy optimal sampling (m=3, n=32) uniform sampling (m=3, n=32) optimal sampling (m=6, n=32) uniform sampling (m=6, n=32) full participation (m=32, n=32) 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) Validation Accuracy Figure 8: (FEMNIST Dataset 1, n = 32) current best validation accuracy as a function of the number of communication rounds and the number of bits communicated from clients to the master. 0 20 40 60 80 100 120 140 160 Communication Round Validation Accuracy optimal sampling (m=3, n=32) uniform sampling (m=3, n=32) optimal sampling (m=6, n=32) uniform sampling (m=6, n=32) full participation (m=32, n=32) 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) Validation Accuracy Figure 9: (FEMNIST Dataset 2, n = 32) current best validation accuracy as a function of the number of communication rounds and the number of bits communicated from clients to the master. 0 20 40 60 80 100 120 140 160 Communication Round Validation Accuracy optimal sampling (m=3, n=32) uniform sampling (m=3, n=32) optimal sampling (m=6, n=32) uniform sampling (m=6, n=32) full participation (m=32, n=32) 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) Validation Accuracy Figure 10: (FEMNIST Dataset 3, n = 32) current best validation accuracy as a function of the number of communication rounds and the number of bits communicated from clients to the master. 0 20 40 60 80 100 120 140 160 Communication Round Validation Accuracy optimal sampling (m=2, n=32) optimal sampling (m=4, n=32) optimal sampling (m=6, n=32) optimal sampling (m=12, n=32) full participation (m=32, n=32) uniform sampling (m=2, n=32) uniform sampling (m=4, n=32) uniform sampling (m=6, n=32) uniform sampling (m=12, n=32) 20 22 24 26 28 210 Bits Communicated from Clients to the Master ( 108) Validation Accuracy Figure 11: (Shakespeare Dataset, n = 32) current best validation accuracy as a function of the number of communication rounds and the number of bits communicated from clients to the master. Published in Transactions on Machine Learning Research (08/2022) 0 20 40 60 80 100 120 140 160 Communication Round Validation Accuracy optimal sampling (m=2, n=128) optimal sampling (m=4, n=128) optimal sampling (m=6, n=128) optimal sampling (m=12, n=128) full participation (m=128, n=128) uniform sampling (m=2, n=128) uniform sampling (m=4, n=128) uniform sampling (m=6, n=128) uniform sampling (m=12, n=128) 20 22 24 26 28 210 212 Bits Communicated from Clients to the Master ( 108) Validation Accuracy Figure 12: (Shakespeare Dataset, n = 128) current best validation accuracy as a function of the number of communication rounds and the number of bits communicated from clients to the master. 5 rounds. In each round, n = 32 clients are sampled from the client pool, each of which then performs SGD for 1 epoch on its local training images with batch size 20. For partial participation, the expected number of clients allowed to communicate their updates back to the master is set to m {3, 6}. We use vanilla SGD and constant step sizes for all experiments, where we set ηg = 1 and tune ηl from the set of value {2 1, 2 2, 2 3, 2 4, 2 5}. If the optimal step size hits a boundary value, then we try one more step size by extending that boundary and repeat this until the optimal step size is not a boundary value. For full participation and optimal sampling, it turns out that ηl = 2 3 is the optimal local step size for all three datasets. For uniform sampling, the optimal is ηl = 2 5 for Dataset 1 and ηl = 2 4 for Datasets 2 and 3. For the extra communications in Algorithm 2, we set jmax = 4. We also present some additional figures of the experiment results. Figures 8, 9 and 10 show the current best validation accuracy as a function of the number of communication rounds and the number of bits communicated from clients to the master on Datasets 1, 2 and 3, respectively. F.2 Shakespeare Dataset We detail the hyper-parameters used in the experiments on the Shakespeare dataset. For each experiment, we run 151 communication rounds, reporting (local) training loss every round and validation accuracy every 5 rounds. In each round, n {32, 128} clients are sampled from the client pool, each of which then performs SGD for 1 epoch on its local training data with batch size 8 (each batch contains 8 example sequences of length 5). For partial participation, the expected number of clients allowed to communicate their updates back to the master is set to m {2, 4, 6, 12}. We use vanilla SGD and constant step sizes for all experiments, where we set ηg = 1 and tune ηl from the set of value {2 1, 2 2, 2 3, 2 4, 2 5}. If the optimal step size hits a boundary value, then we try one more step size by extending that boundary and repeat this until the optimal step size is not a boundary value. For full participation and optimal sampling, it turns out that ηl = 2 2 is the optimal local step size. For uniform sampling, the optimal is ηl = 2 3. For the extra communications in Algorithm 2, we set jmax = 4. We also present an additional figure of the experiment result. Figures 11 and 12 show the current best validation accuracy as a function of the number of communication rounds and the number of bits communicated from clients to the maste for the cases n = 32, 128, respectively. G Additional Experiment on Federated CIFAR100 Dataset We evaluate our method on the Federated CIFAR100 image dataset for image classification. The Federated CIFAR100 dataset is a balanced dataset, where every client holds the same number of training images. In each communication round, n = 32 clients are sampled uniformly from the client pool, each of which then performs several SGD steps on its local training images for 1 epoch with batch size 20. This means that all clients have the same number of local steps in each round. For partial participation, the expected number of clients allowed to communicate their updates back to the master is set to m = 3. We use vanilla SGD optimizers with constant step sizes for both clients and the master, with ηg = 1 and ηl tuned on a holdout Published in Transactions on Machine Learning Research (08/2022) 0 200 400 600 800 1000 Communication Round Validation Accuracy optimal sampling (m=3, n=32) uniform sampling (m=3, n=32) full participation (m=32, n=32) 0 200 400 600 800 1000 Communication Round (Local) Training Loss 23 25 27 29 211 213 215 Bits Communicated from Clients to the Master ( 108) Validation Accuracy 23 25 27 29 211 213 215 Bits Communicated from Clients to the Master ( 108) (Local) Training Loss Figure 13: (CIFAR100 Dataset, n = 32) Validation accuracy and (local) training loss as a function of the number of communication rounds and the number of bits communicated from clients to the master. set. For full participation and optimal sampling, it turns out that ηl = 1 10 3 is the optimal local step size. For uniform sampling, the optimal is ηl = 3 10 4. We set jmax = 4 and include the extra communication costs in our results. The main results are shown in Figure 13. It can be seen that our optimal client sampling scheme achieves better performance than uniform sampling on this balanced dataset. The performance gains of our method over uniform sampling come from the fact that the norms of the updates from some clients are larger than those from other clients even if all clients run the same number of local steps in each round.