# buffered_asynchronous_sgd_for_byzantine_learning__23cad17f.pdf Journal of Machine Learning Research 24 (2023) 1-62 Submitted 2/22; Published 5/23 Buffered Asynchronous SGD for Byzantine Learning Yi-Rui Yang yangyr@smail.nju.edu.cn Wu-Jun Li liwujun@nju.edu.cn National Key Laboratory for Novel Software Technology Department of Computer Science and Technology Nanjing University, Nanjing 210023, China Editor: Sathiya Keerthi Distributed learning has become a hot research topic due to its wide application in clusterbased large-scale learning, federated learning, edge computing, and so on. Most traditional distributed learning methods typically assume no failure or attack. However, many unexpected cases, such as communication failure and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention. Most existing BL methods are synchronous, which are impractical in some applications due to heterogeneous or offline workers. In these cases, asynchronous BL (ABL) is usually preferred. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. To the best of our knowledge, BASGD is the first ABL method that can resist non-omniscient attacks without storing any instances on the server. Furthermore, we also propose an improved variant of BASGD, called BASGD with momentum (BASGDm), by introducing local momentum into BASGD. Compared with those methods which need to store instances on server, BASGD and BASGDm have a wider scope of application. Both BASGD and BASGDm are compatible with various aggregation rules. Moreover, both BASGD and BASGDm are proved to be convergent and able to resist failure or attack. Empirical results show that our methods significantly outperform existing ABL baselines when there exists failure or attack on workers. Keywords: distributed machine learning, momentum, asynchronous Byzantine learning, buffer, stochastic gradient descent 1. Introduction Due to the wide application in cluster-based large-scale learning, federated learning (Konevcn y et al., 2016; Kairouz et al., 2021), edge computing (Shi et al., 2016), and so on, distributed learning has recently become a hot research topic (Zinkevich et al., 2010; Yang, 2013; Jaggi et al., 2014; Shamir et al., 2014; Zhang and Kwok, 2014; Ma et al., 2015; Lee et al., 2017; Lian et al., 2017; Zhao et al., 2017; Sun et al., 2018; Wangni et al., 2018; Zhao et al., 2018; Zhou et al., 2018; Yu et al., 2019a,b; Haddadpour et al., 2019; Assran et al., 2020; Nokleby et al., 2020). Most traditional distributed learning methods are based on stochastic gradient descent (SGD) and its variants (Bottou, 2010; Xiao, 2010; Duchi et al., 2011; Johnson and Zhang, 2013; Shalev-Shwartz and Zhang, 2013; Zhang et al., 2013; Lin et al., 2014; Schmidt . Corresponding author. c 2023 Yi-Rui Yang and Wu-Jun Li. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0184.html. Yang and Li et al., 2017; Zheng et al., 2017; Zhao et al., 2018; Duan et al., 2020; Zhao et al., 2021), and typically assume no failure or attack. However, in distributed learning applications with multiple networked machines (nodes), different kinds of hardware or software failure may happen (Lamport et al., 2019; Wang et al., 2020; Kairouz et al., 2021). Representative failure includes bit-flipping in the communication media and the memory of some workers (Xie et al., 2019). In this case, small failure on some machines (workers) might cause a distributed learning method to fail. In addition, malicious attack should not be neglected in an open network where the manager (or server) generally has not much control over the workers, such as in the cases of edge computing and federated learning. Malicious workers may behave arbitrarily or even adversarially. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has attracted much attention (Diakonikolas et al., 2017; Chen et al., 2017; Damaskinos et al., 2018; Baruch et al., 2019; Diakonikolas and Kane, 2019; Wu et al., 2020; Karimireddy et al., 2022). BL methods can be divided into two main categories: synchronous BL (SBL) methods and asynchronous BL (ABL) methods. In SBL methods, the learning information, such as the gradient in SGD, of all workers will be aggregated in a synchronous way. On the contrary, in ABL methods the learning information of workers will be aggregated in an asynchronous way. Existing SBL methods mainly take three different ways to achieve resilience against Byzantine workers which refer to those workers with failure or attack. The first way is to filter the suspicious learning information (gradients) before averaging. Representative examples include Byzantine SGD (Alistarh et al., 2018) and Zeno (Xie et al., 2019). The second way is to replace the simple averaging aggregation operation with some more robust aggregation operations, such as median & trimmed-mean (Yin et al., 2018), geometric median (Chen et al., 2017), and centered-clipping (Karimireddy et al., 2021). Krum (Blanchard et al., 2017), RSA (Li et al., 2019), Byzantine PGD (Yin et al., 2019) and Sign SGD (Seide et al., 2014; Bernstein et al., 2019; Sohn et al., 2020) also take this way. The third way is based on redundancy. In this kind of methods such as DRACO (Chen et al., 2018), DETOX (Rajput et al., 2019), Byz Shield (Konstantinidis and Ramamoorthy, 2021), Byzantine resilience is achieved by assigning computation task of each gradient to several nodes. In these methods, the manager (or server) may have access to the exact true gradient despite the existence of Byzantine workers. However, methods based on redundancy usually have higher computation cost and storage cost than methods that take the other two ways. Some recent works on SBL also reveal that using history information can strengthen the Byzantine resilience (Allen-Zhu et al., 2020; El-Mhamdi et al., 2021b; Karimireddy et al., 2021). The advantage of SBL methods is that they are relatively simple and easy to be implemented, but SBL methods will result in slow convergence when there exist heterogeneous workers. Furthermore, in some applications like federated learning and edge computing, synchronization cannot even be performed most of the time due to the offline workers (clients or edge servers). Hence, ABL methods are preferred in these cases. To the best of our knowledge, there exist only a few ABL methods. Kardam (Damaskinos et al., 2018) introduces two filters to drop out suspicious learning information (gradients), which can still achieve good performance when the communication delay is heavy. However, when in the face of malicious attack, some work (Xie et al., 2020b) finds that Kardam also drops out most correct gradients in order to filter all faulty (failure) gradients. Hence, Buffered Asynchronous SGD for Byzantine Learning Kardam cannot resist malicious attack. Zeno++ (Xie et al., 2020b) and Sageflow (Park et al., 2021) need to store some training instances on the server. In some practical applications like federated learning (Kairouz et al., 2021), storing data on server will increase the risk of privacy leakage or even face legal risk. There are also existing works (El-Mhamdi et al., 2021a) that study ABL under the decentralized framework. As far as we know, under the Parameter Server framework where the server has no access to any training instances, there does not exist any ABL method that can resist malicious attack. Moreover, in some recently proposed attacks (Xie et al., 2020a; Baruch et al., 2019), attackers are assumed to have access to all the information on other workers and use the information for attack. This type of attacks are called omniscient attacks, while the others are called non-omniscient attacks. As far as we know, there does not exist any ABL method that can resist the two omniscient attacks Fall of Empires (Xie et al., 2020a) and A Little is Enough (Baruch et al., 2019). In this paper, we propose a novel method called buffered asynchronous stochastic gradient descent (BASGD) and an improved variant of BASGD called BASGD with momentum (BASGDm) for ABL. The main contributions are listed as follows: To the best of our knowledge, BASGD is the first ABL method that can resist nonomniscient attacks without storing any instances on the server. Compared with those methods which need to store instances on the server, BASGD has a wider scope of application. An improved variant of BASGD, called BASGD with momentum (BASGDm), is further proposed by introducing local momentum into BASGD. As far as we know, BASGDm is the first ABL method that can resist the two omniscient attacks Fall of Empires and A Little is Enough . Both BASGD and BASGDm are compatible with various aggregation rules. Moreover, both BASGD and BASGDm are proved to be convergent and able to resist failure or attack. Empirical results show that our methods significantly outperform existing ABL baselines when there exists failure or attack on workers. 2. Preliminary In this section, we present the preliminary of this paper, including the distributed learning framework used in this paper and the definition of Byzantine worker. 2.1 Distributed Learning Framework Many machine learning models, such as logistic regression and deep neural networks, can be formulated as the following finite sum optimization problem: min w Rd F(w) = 1 i=1 f(w; zi), (1) Yang and Li where w is the parameter to learn, d is the dimension of parameter, n is the number of training instances, f(w; zi) is the empirical loss on the instance zi. The goal of distributed learning is to solve the problem in (1) by designing learning algorithms based on multiple networked machines. Although there have appeared many distributed learning frameworks, in this paper we focus on the widely used Parameter Server (PS) framework (Li et al., 2014). In a PS framework, there are several workers and one or more servers. Each worker can only communicate with server(s). There may exist more than one server in a PS framework, but for the problem of this paper, servers can be logically conceived as a unity. Without loss of generality, we will assume there is only one server in this paper. Training instances are disjointedly distributed across m workers. Let Dk denote the index set of training instances on worker k, we have m k=1Dk = {1, 2, . . . , n} and Dk Dk = if k = k . In this paper, we assume that the server has no access to any training instances. If two instances have the same value, they are still deemed as two distinct instances. Namely, zi may equal zi (i = i ). One popular asynchronous method to solve the problem in (1) under the PS framework is ASGD (Dean et al., 2012) (see Appendix A for details). In this paper, we assume each worker samples one instance for gradient computation each time. The analysis of the mini-batch case is similar. In PS-based ASGD, the server is responsible for updating and maintaining the latest parameter. The number of iterations that the server has already executed is used as the global logical clock of the server. In the beginning, iteration number t = 0. Each time an SGD step is executed, t will increase by 1 immediately. The parameter after t iterations is denoted as wt. If the server sends parameters to worker k at iteration t , some SGD steps may have been executed before the server receives gradient from worker k next time at iteration t. Thus, we define the delay of worker k at iteration t as τ t k = t t . Worker k is heavily delayed at iteration t if τ t k > τmax, where τmax is a pre-defined non-negative constant. In the analysis of this work, we will mainly consider the partially asynchronous setting (Bertsekas et al., 1989), where a limited number of heavily delayed workers at each iteration are assumed. When there is no confusion, we will omit the word partially in the following text. 2.2 Byzantine Worker For workers that have sent gradients (one or more) to the server at iteration t, we call worker k loyal worker if it has finished all the tasks without any fault and each sent gradient is correctly received by the server. Otherwise, worker k is called Byzantine worker. If worker k is a Byzantine worker, it means the received gradient from worker k is not credible, which can be an arbitrary value. Formally, we denote the gradient computed by worker k at iteration t as gt k. Then, we have: ( f(wt ; zi), if worker k is loyal at iteration t; , if worker k is Byzantine at iteration t, where 0 t t, and i is randomly sampled from Dk. represents an arbitrary value. Our definition of Byzantine worker is consistent with most previous works (Blanchard et al., Buffered Asynchronous SGD for Byzantine Learning Figure 1: An example of buffers. Each circle represents a worker, and the number is the worker ID. There are 15 workers and 5 buffers. The gradient received from worker s is stored in buffer {s mod 5}. 2017; Xie et al., 2019, 2020b). Either accidental failure or malicious attack will result in Byzantine workers. We would also like to clarify that, in vanilla ASGD, there is at most one gradient sent from any worker k at each iteration. However, in the new method called BASGD that we will present in Section 3, there are possibly multiple gradients sent from any worker k at each iteration. 3. Buffered Asynchronous SGD In synchronous BL, gradients from all workers are received at each iteration. We can compare the gradients with each other, and then filter suspicious ones, or use more robust aggregation rules such as median and trimmed-mean for updating. However, in asynchronous BL, only one gradient is received at a time. Without any training instances stored on the server, it is difficult for the server to identify whether a received gradient is credible or not. In order to deal with this problem in asynchronous BL, we propose a novel method called buffered asynchronous SGD (BASGD). BASGD introduces B buffers (0 < B m) on the server, and the gradient used for updating parameters will be aggregated from these buffers. The detail of the learning procedure of BASGD is presented in Algorithm 1. In this section, we will first introduce the three key components of BASGD: buffer, aggregation function, and mapping table. At the end of this section, we will also introduce an improved variant of BASGD which is called buffered asynchronous SGD with momentum (BASGDm). In BASGD, the m workers do the same job as that in ASGD, while the updating rule on the server is modified. More specifically, there are B buffers (0 < B m) on server. When a gradient g from worker s is received, it will be temporarily stored in buffer b, where b = s mod B, as illustrated in Figure 1. Only when each buffer has stored at least one gradient, a new SGD step will be executed. Please note that no matter whether an SGD step Yang and Li Algorithm 1 Buffered Asynchronous SGD (BASGD) Server: Input: learning rate η, reassignment interval , buffer number B, aggregation function: Aggr( ); Initialization: model parameter w0; Set hb 0 and N0 b 0 for all b = 0, . . . , B 1; Initialize mapping table βs s (s = 0, 1, . . . , m 1); Send initial w0 to all workers; Set t 0, and start the timer; repeat Wait until receiving g from some worker s; Choose buffer: b βs mod B; Let Nt b Nt b + 1, and hb (Nt b 1)hb+g if Nt b > 0 for each b {0, . . . , B 1} then Aggregate: Gt = Aggr([h0, . . . , h B 1]); Execute SGD step: wt+1 wt η Gt; Zero out buffers: hb 0, Nt b 0 (b = 0, . . . , B 1); Set t t + 1, and restart the timer; end if if the timer has exceeded seconds then Zero out buffers: hb 0, Nt b 0 (b = 0, . . . , B 1); Modify the mapping table {βs}m 1 s=0 for buffer reassignment, and restart the timer; end if Send the latest parameters back to worker s, no matter whether an SGD step is executed or not. until stop criterion is satisfied Notify all workers to stop; Worker k: (k = 0, 1, ..., m 1) repeat Wait until receiving the latest parameter w from server; Randomly sample an index i from Dk; Compute f(w; zi); Send f(w; zi) to server; until receive server s notification to stop is executed or not, the server will immediately send the latest parameters back to the worker after receiving a gradient. Hence, BASGD introduces no barrier and is an asynchronous algorithm. For each buffer b, more than one gradient may have been received at iteration t. We will store the average of these gradients (denoted by hb) in buffer b. Assume that there are already (N 1) gradients g1, g2, . . . , g N 1 which should be stored in buffer b, and Buffered Asynchronous SGD for Byzantine Learning hb(old) = 1 N 1 PN 1 i=1 gi. When the N-th gradient g N is received, the new average value is: hb(new) = 1 i=1 gi = N 1 N hb(old) + 1 This is the updating rule for each buffer b when a gradient is received. We use Nt b to denote the total number of gradients stored in buffer b at the t-th iteration. After the parameter w is updated, all buffers will be zeroed out at once. With the benefit of buffers, the server has access to B candidate gradients when updating model parameters. Thus, a more reliable (robust) gradient can be aggregated from the B gradients of buffers, if a proper aggregation function Aggr( ) is chosen. Please note that from the perspective of workers, BASGD is fully asynchronous, since a worker will immediately receive the latest parameter from the server after sending a gradient to the server, without waiting for other workers. Meanwhile, from the perspective of the server, BASGD is semi-asynchronous because the server will not update the model until all buffers are filled. Actually, it is a necessity to limit the updating frequency in ABL when the server has no instances. If the server always updates the model when receiving a gradient, it will be easily foiled when Byzantine workers send gradients much more frequently than others. A similar conclusion has been proved in previous works (Damaskinos et al., 2018). 3.2 Aggregation Function When an SGD step is ready to be executed, there are B buffers providing candidate gradients. An aggregation function is needed to get the final gradient for updating. A naive way is to take the mean of all candidate gradients. However, the mean value is sensitive to outliers which are common in BL. For designing proper aggregation functions, we first define the q-Byzantine Robust (q-BR) condition to quantitatively describe the Byzantine resilience ability of an aggregation function. Definition 1 (q-Byzantine Robust). For an aggregation function Aggr( ): Aggr([h0, . . . , h B 1]) = G, where G = [G1, . . . , Gd]T and hb = [hb1, . . . , hbd]T , b {0, . . . , B 1}, we call Aggr( ) q-Byzantine Robust (q Z, 0 < q < B/2), if it satisfies the following two properties: (a) Aggr([h0 + h , . . . , h B 1 + h ]) = Aggr([h0, . . . , h B 1]) + h , h0, . . . , h B 1, h Rd; (b) mins S{hsj} Gj maxs S{hsj}, j [d], S {0, . . . , B 1} with |S| = B q. Intuitively, property (a) in Definition 1 says that if all candidate vectors hi are added by a same vector h , the aggregated gradient will also be added by h . Property (b) says that for each coordinate j, the aggregated value Gj will be between the (q + 1)-th smallest value and the (q + 1)-th largest value among the j-th coordinates of all candidate vectors. Thus, the gradient aggregated by a q-BR function is insensitive to at least q outliers. We can find that the q-BR condition gets stronger when q increases. Namely, if Aggr( ) is q-BR, then for any 0 < q < q, Aggr( ) is also q -BR. Remark 2. When B > 1, the mean function is not q-Byzantine Robust for any q > 0. We illustrate this with a one-dimension example. Let h0, . . . , h B 2 [0, 1] and h B 1 = 10 B. Then 1 B PB 1 b=0 hb h B 1 B = 10. Thus, the mean is larger than any of the first B 1 values. Yang and Li We show that the following two widely-used aggregation functions are both q-BR. Definition 3 (Coordinate-wise median (Yin et al., 2018)). For candidate vectors h0, h1, . . ., h B 1 Rd, hb = [hb1, hb2, . . . , hbd]T , b = 0, . . . , B 1. Coordinate-wise median is defined as: Med([h0, . . . , h B 1]) = [Med(h 1), . . . , Med(h d)]T , where Med(h j) is the scalar median of the j-th coordinates, j = 1, 2, . . . , d. Definition 4 (Coordinate-wise q-trimmed-mean (Yin et al., 2018)). For any positive interger q < B/2 and candidate vectors h0, h1, . . . , h B 1 Rd, hb = [hb1, hb2, . . . , hbd]T , b = 0, . . . , B 1. Coordinate-wise q-trimmed-mean is defined as: Trm([h0, . . . , h B 1]) = [Trm(h 1), . . . , Trm(h d)]T , where Trm(h j) = 1 B 2q P b Mj hbj is the scalar q-trimmed-mean. Mj is the subset of {hbj}B 1 b=0 obtained by removing the q largest elements and q smallest elements. In the following content, coordinate-wise median and coordinate-wise q-trimmed-mean are also called median and trmean, respectively. Proposition 5. Coordinate-wise q-trmean is q-BR. Coordinate-wise median is B 1 Here, x represents the maximum integer that is not larger than x. According to Proposition 5, both median and trmean are proper choices for aggregation function in BASGD. The proof can be found in Appendix B. We also define another class of aggregation functions called (δmax, A1, A2)-effective aggregation functions in Definition 6 and Definition 7. The definition of (δmax, A1, A2)- effective aggregation function can be deemed as a bridge across synchronous Byzantine learning and asynchronous Byzantine learning. Definition 6 (Stable aggregation function). Aggregation function Aggr( ) is called stable provided that h0, . . . , h B 1, h0, . . . , h B 1 Rd, we have: Aggr([h0, . . . , h B 1]) Aggr( h0, . . . , h B 1) hb hb 2 ! 1 Definition 6 says that if Aggr( ) is a stable aggregation function, when there is a disturbance on buffers, the disturbance on the aggregated result by Aggr( ) will not be larger than the disturbance on buffers in L2-norm. Definition 7 (Effective aggregation function). When the fraction of Byzantine workers is not larger than δmax, stable aggregation function Aggr( ) is called a (δmax, A1, A2)-effective aggregation function, provided that it satisfies the following two properties for all wt Rd in cases without delay (τ t k = 0, t = 0, 1, . . . , T 1): (a) E[ F(wt)T Gt syn | wt] F(wt) 2 A1; (b) E[ Gt syn 2 | wt] (A2)2; where A1, A2 R+ are two non-negative constants, Gt syn is the aggregated result of Aggr( ) at the t-th iteration in cases without delay. Buffered Asynchronous SGD for Byzantine Learning More specifically, Gt syn can be the aggregated gradient or momentum. In the conference version of BASGD (Yang and Li, 2021), Gt syn is the aggregated gradient. We change the statement to make it compatible with the BASGDm method (please refer to Section 3.4). The two properties in Definition 7 are for synchronous cases mainly because we would like to use this definition to extend existing theoretical results for synchronous Byzantine learning methods to those for asynchronous cases. Please see Section 4 for the detailed results. For different aggregation functions, constants A1 and A2 may differ. A1 and A2 are related to loss function F( ), distribution of instances, buffer number B and maximum Byzantine worker fraction δmax. Inequalities (a) and (b) in Definition 7 are two important properties in convergence proof of synchronous Byzantine learning methods. As revealed in (Yang et al., 2020), there are many existing aggregation rules for Byzantine learning. We find that most of them satisfy Definition 7. For example, Krum, median, and trimmed-mean have already been proved to satisfy these two properties (Blanchard et al., 2017; Yin et al., 2018). Sign SGD (Bernstein et al., 2019) can be seen as a combination of 1-bit quantization and median aggregation, while median satisfies the properties in Definition 7. The q-BR property in Definition 1 is relatively easy to check, while the definition of (δmax, A1, A2)-effective aggregation allows us to extend existing theoretical results for synchronous cases to those for the asynchronous cases, which we mainly focus on in this work. Besides the two types of aggregation rules presented in Definition 1 and Definition 7, we also introduce the definition of (δmax, c)-robust aggregation function in Definition 8. Definition 8 ((δmax, c)-robust aggregation function). Aggregation function Aggr( ) is called (δmax, c)-robust provided that for any B independent random vectors h0, . . . , h B 1 Rd and any set H {0, 1, . . . , B 1} with 1 |H| B = δ δmax where δmax < 1 2, we have: Aggr([h0, . . . , h B 1]) 1 where constant ρ 0 satisfies that E hb hb 2 ρ2 for any fixed b, b H. Definition 8 has been used in previous works (Karimireddy et al., 2021) to theoretically prove that using momentum can enhance the resilience against Byzantine attacks for i.i.d. cases in synchronous BL. Moreover, it has been proved that the aggregation error O(δρ2) is theoretically optimal (Karimireddy et al., 2021). We will also introduce momentum to BASGD in Section 3.4 and theoretically prove that using momentum can also enhance the resilience against Byzantine attacks for i.i.d. cases in asynchronous BL in Section 4. Meanwhile, please note that too large B could lower the updating frequency and damage the performance, while too small B may harm the Byzantine resilience. Thus, a moderate B is usually preferred. From another perspective, the choice of B can be viewed as a trade-offbetween efficiency and Byzantine resilience. In practical applications, practitioners are suggested to first determine the maximum fraction of Byzantine workers δmax that the system can tolerate, and then set B to make the aggregation function resilient to up to a fraction of δmax Byzantine workers. Yang and Li Figure 2: An example of buffer reassignment. The white circles represent active workers, and the grey circles represent unresponsive workers. Before reassignment, buffer 0 is a straggler. After reassignment, there is at least one active worker corresponding to each buffer. 3.3 Mapping Table At each iteration of BASGD, buffer b needs at least one gradient for aggregation. In the worst case, all the workers corresponding to buffer b may be unresponsive. In this case, buffer b will become the straggler, and slow down the whole learning process. To deal with this problem, we introduce the mapping table for buffer reassignment. We call a worker active worker if it has responded at the current iteration. If the SGD step has not been executed for seconds, the server immediately zeroes out stored gradients in all buffers, equally reassigns active workers to each buffer, and then continues the learning procedure. Hyper-parameter is called reassignment interval. Figure 2 illustrates an example of reassignment. The grey circles represent unresponsive workers. After reassignment, there is at least one active worker corresponding to each buffer. Specifically, we introduce a mapping table {βs}m 1 s=0 for buffer reassignment. Initially, βs = s ( s = 0, 1, . . . , m 1). When reassigning buffers, the server only needs to modify the mapping table {βs}m 1 s=0 , and then stores worker s s gradients in buffer {βs mod B}, instead of buffer {s mod B}. Please note that the server only needs to modify the mapping table for buffer reassignment, and there is no need to notify workers. In addition, a timer is used on the server for indicating when to reassign buffers. The timer is started at the beginning of BASGD and is restarted immediately after each SGD Buffered Asynchronous SGD for Byzantine Learning step or buffer reassignment. When the timer exceeds seconds, buffers will be zeroed out and then reassigned. Hyper-parameter should be set properly. If is too small, buffers will be zeroed out too frequently, which may slow down the learning process. If is too large, straggler buffers could not be eliminated in time. In practical applications, practitioners can collect statistics about workers time cost on computing gradients (or momentums), and then properly set to make that more than B workers are usually able to finish the computation and send messages in seconds. 3.4 Buffered Asynchronous SGD with Momentum As previous works have revealed, history information can greatly help to resist Byzantine attacks (El-Mhamdi et al., 2021b; Allen-Zhu et al., 2020; Karimireddy et al., 2021). Therefore, we introduce momentum into BASGD and obtain the method called buffered asynchronous SGD with momentum (BASGDm). In BASGDm, the algorithm of the server is exactly the same as that in BASGD. The only difference is that each worker maintains a local momentum, and sends local momentums to the server instead of gradients. The detail of BASGDm is illustrated in Algorithm 2. With the benefit of momentum, BASGDm can achieve stronger Byzantine resilience. In particular, BASGDm has a significantly better empirical performance than BASGD, as we will show in Section 5. Meanwhile, we have noticed that the work in (Nguyen et al., 2022) proposes the method Fed Buff, which also adopts buffered asynchronous aggregation. However, the motivation of BASGD (BASGDm) and Fed Buffsignificantly differ from each other. Fed Buffmainly focuses on privacy preservation in federated learning while BASGD and BASGDm are for asynchronous Byzantine learning. 4. Convergence In this section, we theoretically prove the convergence and resilience of BASGD and BASGDm against failure or attack. We will introduce four main theorems in this section. The first theorem is for BASGD with q-BR aggregation functions. The second and the third theorems are for BASGD and BASGDm with (δmax, A1, A2)-effective aggregation functions, respectively. The last theorem is for BASGDm with (δmax, c)-robust aggregation functions in i.i.d. cases. Furthermore, the last theorem also shows the effectiveness of using local momentum in asynchronous Byzantine learning. Here we only present the results. Proof details are in Appendix B. We first make the following assumptions, which have been widely used in stochastic optimization. Assumption 1 (Lower bound). Global loss function F(w) is bounded below: F R, F(w) F , w Rd. Assumption 2 (Bounded bias). For any loyal worker k, it can use locally stored training instances to obtain an estimation of the global gradient with bounded bias κ: κ R+, Ei Dk[ f(w; zi)] F(w) κ, w Rd. Assumption 3 (Bounded gradient). Global loss function F(w) has a bounded gradient: D R+, F(w) D, w Rd. Yang and Li Algorithm 2 Buffered Asynchronous SGD with Momentum (BASGDm) Server: Input: learning rate η, momentum hyper-parameter µ (0 µ < 1), reassignment interval , buffer number B, aggregation function: Aggr( ); Initialization: model parameter w0; Set hb 0 and N0 b 0 for all b = 0, . . . , B 1; Initialize mapping table βs s (s = 0, 1, . . . , m 1); Send initial w0 to all workers; Set t 0, and start the timer; repeat Wait until receiving u from some worker s; Choose buffer: b βs mod B; Let Nt b Nt b + 1, and hb (Nt b 1)hb+u if Nt b > 0 for each b {0, . . . , B 1} then Aggregate: Gt = Aggr([h0, . . . , h B 1]); Execute SGD step: wt+1 wt η Gt; Zero out buffers: hb 0, Nt b 0 (b = 0, . . . , B 1); Set t t + 1, and restart the timer; end if if the timer has exceeded seconds then Zero out buffers: hb 0, Nt b 0 (b = 0, . . . , B 1); Modify the mapping table {βs}m 1 s=0 for buffer reassignment, and restart the timer; end if Send the latest parameters back to worker s, no matter whether an SGD step is executed or not. until stop criterion is satisfied Notify all workers to stop; Worker k: (k = 0, 1, ..., m 1) Initialization: initial momentum u 0; repeat Wait until receiving the latest parameter w from server; Randomly sample an index i from Dk; Compute stochastic gradient f(w; zi); Update local momentum u f(w; zi), at the first iteration; µ u + (1 µ) f(w; zi), otherwise; Send u to server; until receive server s notification to stop Assumption 4 (Bounded variance). For any loyal worker k, the local stochastic gradient has a bounded variance: σ R+, Ei Dk|| f(w; zi) Ei Dk[ f(w; zi)]||2 σ2, w Rd. Assumption 5 (L-smoothness). Global loss function F(w) is differentiable and L-smooth: || F(w) F(w )|| L||w w ||, w, w Rd. Buffered Asynchronous SGD for Byzantine Learning Compared with the case of synchronous Byzantine learning, the threat of Byzantine attacks could be enlarged in asynchronous settings due to the bias caused by asynchrony. The bounded gradient assumption is mainly used to provide upper bounds for the bias of stochastic gradients (in BASGD) or local momentums (in BASGDm) caused by the combined effect of asynchrony and Byzantine attacks. Although F(w) is bounded, Byzantine workers are allowed to send vectors with arbitrary values. Then we first analyze the convergence of BASGD with q-BR aggregation functions. Let N(t) be the (q + 1)-th smallest value in {Nt b}b {0,...,B 1}, where Nt b is the number of gradients (or momentums) stored in buffer b at the t-th iteration. We use r to denote the total number of heavily delayed workers and Byzantine workers, and define the constant ΘB,q,r = (B r) (B q 1)(q r + 1) , which will appear in Lemma 9 and Lemma 10. Lemma 9. If Aggr( ) is q-BR and the total number of heavily delayed workers and Byzantine workers is not larger than r (r q), under Assumptions 3 and 4, we have: E[||Gt||2 | wt] ΘB,q,rd (D2 + σ2/N(t)). Lemma 10. If Aggr( ) is q-BR, and the total number of heavily delayed workers and Byzantine workers is not larger than r (r q), under Assumptions 2, 3, 4 and 5, we have: ||E[Gt F(wt) | wt]|| ΘB,q,rd(τmax L [ΘB,q,rd(D2 + σ2/N(t))] 1 2 + σ + κ). Theorem 11. Let D = 1 T PT 1 t=0 (D2 + σ2/N(t)) 1 2 . If the total number of Byzantine workers and heavily delayed workers at each iteration is not larger than r, Aggr( ) is q-BR where q = r, under Assumptions 1, 2, 3, 4 and 5, we have the following result for BASGD with learning rate η = O( 1 L PT 1 t=0 E[|| F(wt)||2] T O L[F(w0) F ] 2(1 δmax)rd D 2(1 δmax)r Ddσ δmax + 2(1 δmax)r Ddκ 2(1 δmax) 3 2 r 3 2 LD Dd 3 2 τmax (δmax) 3 2 where δmax = q Please note that the convergence rate of vanilla ASGD is O(1/T 1 2 ). Hence, Theorem 11 indicates that BASGD has a theoretical convergence rate as fast as vanilla ASGD, with an extra constant variance. The term O(2(1 δmax)r Ddσ/δmax) is caused by the aggregation function, which can be deemed as a sacrifice for Byzantine resilience. The term O(2(1 δmax)r Ddκ/δmax) is caused by the differences in training instances among different workers. In independent and identically distributed (i.i.d.) cases, κ = 0 and the term vanishes. The term O(2 2(1 δmax) 3 2 r 3 2 LD Dd 3 2 τmax/δ 3 2max) is caused by the delay, and related to parameter τmax. The term is also related to the buffer size since δmax = q Yang and Li Nt b increases, N(t) may increase, and thus D will decrease. Namely, a larger buffer size will result in smaller D. In addition, the factor (1 δmax)r/δmax or (1 δmax) 3 2 r 3 2 /δ 3 2max decreases as δmax increases, and increases as r increases. Then, we present the convergence results for BASGD and BASGDm with (δmax, A1, A2)- effective aggregation functions (please refer to Definition 7) in Theorem 12 and Theorem 13, respectively. Theorem 12. In BASGD, if the total number of Byzantine workers and heavily delayed workers at each iteration is not larger than r, Aggr( ) is a (δmax, A1, A2)-effective aggregation function, B = r/δmax + 1 and the learning rate η = O( 1 LT ) satisfies that 2η2L2τ 2 max(B r) < 1, under Assumption 1, 3, 4 and 5, we have the following result for general asynchronous cases: PT 1 t=0 E[ F(wt) 2] L 1 2 [F(w0) F ] (r δmaxr + 1) 1 2 L 1 2 τmax DA2 1 2max T 1 2 L 1 2 (A2)2 (r δmaxr + 1)L 3 2 (A2)2τ 2 max δmax T 3 2 Theorem 12 indicates that if Aggr( ) makes a synchronous BL method converge (i.e., satisfies Definition 7), BASGD converges when using Aggr( ) as aggregation function. Hence, BASGD can also be seen as a technique of asynchronization. That is to say, new asynchronous methods can be obtained from synchronous ones when using BASGD. The factor r δmaxr+1 δmax equals ( 1 δmax δmax r + 1 δmax ), which decreases as δmax increases and increases as r increases. The extra constant term A1 is caused by gradient bias. When there is no Byzantine or heavily delayed workers (r = 0) and instances are i.i.d. across workers, letting B = 1 and Aggr([h0, . . . , h B 1]) = Aggr(h0) = h0, BASGD degenerates to vanilla ASGD. In this case, there is no gradient bias (A1 = 0), and BASGD has a convergence rate of O(1/T 1 2 ). Similarly, we have the following theoretical results for BASGDm. Theorem 13. In BASGDm, if the total number of Byzantine workers and heavily delayed workers at each iteration is not larger than r, Aggr( ) is a (δmax, A1, A2)-effective aggregation function, B = r/δmax + 1 and the learning rate η = O( 1 LT ) satisfies that 2η2L2τ 2 max(1 µ)2 < 1, under Assumption 1, 3, 4 and 5, we have the following result for general asynchronous cases: PT 1 t=0 E[ F(wt) 2] L 1 2 [F(w0) F ] (r δmaxr + 1) 1 2 L 1 2 τmax DA2(1 µ) 1 2max T 1 2 L 1 2 (A2)2 (r δmaxr + 1)L 3 2 (A2)2τ 2 max(1 µ)2 Please note that when momentum hyper-parameter µ = 0, BASGDm degenerates to BASGD. In this case, 1 µ = 1, and Theorem 13 is exactly the same as Theorem 12. From this perspective, Theorem 13 can be deemed as a more general version of Theorem 12. In addition, we would also like to point out that the factor (1 µ) in Theorem 13 does not mean Buffered Asynchronous SGD for Byzantine Learning that a larger µ will lead to a tighter upper bound since constants A1 and A2 are dependent on momentum hyper-parameter µ. In fact, the influence of momentum hyper-parameter is a complex problem, which has been studied for decades (Qian, 1999). Since it is not the focus of this work, we are not going to further discuss this problem here. In general cases, Theorem 12 and Theorem 13 guarantee BASGD and BASGDm to find a point such that the squared L2-norm of its gradient is not larger than a positive number close to A1 in expectation, respectively. Please note that Assumption 3 already guarantees that the gradient s squared L2-norm is not larger than D2. We introduce Proposition 14 to show that A1 is guaranteed to be smaller than D2 under a mild condition. Proposition 14. Assume that Aggr( ) is a (δmax, A1, A2)-effective aggregation function, and Gt syn is aggregated by Aggr( ) in synchronous setting. If E[ Gt syn F(wt) | wt] D < D, wt Rd, we have A1 D D < D2. The aggregated result Gt syn can be viewed as a robust estimator of F(wt) used for updating. Since F(wt) D, F(wt) locates in a ball centered at the origin with radius D. E[ Gt syn F(wt) | wt] D < D means that the bias of Gt syn is not larger than the radius D (D < D), which is a mild condition for Aggr( ). In Theorem 11, Theorem 12 and Theorem 13, there exist constant variance terms, which will not decrease during the training. Recent works (Karimireddy et al., 2021) have shown that when the aggregation function is (δmax, c)-robust (please refer to Definition 8) and the momentum hyper-parameter is properly set, synchronous Byzantine learning methods can reach the convergence rate of O(1/T 1 2 ) (without constant term) in i.i.d. cases where the bias κ = 0 in Assumption 2. We show that BASGDm can also achieve a similar convergence rate when κ = 0. The detailed results are presented in Theorem 15 as follows. Theorem 15. Let λ = 1 µ. When Aggr( ) is a (δmax, c)-robust aggregation function, the total number of Byzantine workers and heavily delayed workers at each iteration is not larger than Bδmax, and learning rate η 1 L, under Assumption 1, 2, 3, 4 and 5, we have the following result for BASGDm when κ = 0 (i.i.d. case): PT 1 t=0 E|| F(wt)||2 T 2[F(w0) F ] ηT + 2(4cδ + 1)(τmax + 1)σ2 where δ is the fraction of Byzantine workers and heavily delayed workers together and ζ = 2(4cδ + 1)λ2σ2 + 8(4cδ + 1)2 h 4 λ + 2 4 2λ + 4λ 2 + 2λ 2i η2L2(τmax + 1)2D2. The right-hand side (RHS) of inequality (2) depends on η and λ, where λ = 1 µ and µ is the momentum hyper-parameter. Then we discuss the convergence results for different settings of λ. Firstly, for BASGD without momentum, we have that µ = 0 and λ = 1 µ = 1. In this case, for any η, the term ξ in the RHS of (2) is always larger than 2(4cδ + 1)σ2 and thus cannot converge to 0. It is consistent with the results in previous works (Karimireddy et al., 2021) that the constant term is inevitable without using history information such as local momentum when there are Byzantine workers. As we will detailedly show in Proposition 16 below, a better choice for the momentum hyper-parameter in BASGDm is that λ = ηL. Yang and Li Proposition 16. Under the same conditions in Theorem 15, when λ = ηL 1 and F(w0) F LT(4cδ+1)[σ2+8(4cδ+1)(τmax+1)2D2], 1 L , we have PT 1 t=0 E|| F(wt)||2 T 4L 1 2 [F(w0) F ] 1 2 (4cδ + 1) 1 2 [σ2 + 8(4cδ + 1)(τmax + 1)2D2] 1 2 + 14L 3 4 [F(w0) F ] 3 4 (4cδ + 1) 1 2 (τmax + 1) 1 2 D 1 2 + 6L[F(w0) F ] + 2(4cδ + 1)(τmax + 1)σ2 Proposition 16 shows that the ABL method BASGDm can converge to a stationary point with the convergence rate of O(1/ T), which is the same as that in SBL (Karimireddy et al., 2021). To the best of our knowledge, this is the first theoretical result indicating that using momentum can also enhance the resilience against Byzantine attacks in ABL. In addition, we would like to point out that using local momentum on workers does not help to reduce the bias in non-i.i.d. cases. Theorem 15, Proposition 15 and the corresponding theoretical results in previous works (Karimireddy et al., 2021) for SBL are all based on the i.i.d. assumption. As far as we know, how to enhance Byzantine resilience for non-i.i.d. cases in asynchronous settings is still a challenging open problem. It is beyond the scope of this work and we leave it for future work. Then we discuss the convergence rate with respect to L, T, and τmax. Proposition 16 shows that BASGDm can achieve the convergence rate of O(L 1 2 /T 1 2 ) + O(L 1 2 τmax/T 1 2 ) in i.i.d. cases. Meanwhile, it is shown in previous works (Liu and Zhang, 2021) that vanilla ASGD can achieve the convergence rate of O(L 1 2 /T 1 2 ) + O(Lτmax/T). Compared to vanilla ASGD, the convergence result of BASGDm is more sensitive to τmax. We would like to point out that in ABL, the bias caused by asynchrony will leave more room for Byzantine attacks. Thus, it is reasonable that the convergence result of BASGDm under attacks is slightly looser than that of vanilla ASGD without attack. In addition, it remains uncertain whether the dependence on the staleness parameter τmax in Proposition 16 is tight. To the best of our knowledge, there are almost no works revealing the tightness of τmax in ABL. Finally, we would like to summarize the theoretical results presented in this section. Theorem 11 is for BASGD with q-BR aggregation functions, the definition of which is relatively easy to check. Theorem 12 and Theorem 13 are for BASGD and BASGDm with (δmax, A1, A2)-effective aggregation functions, respectively. The two theorems can be deemed as a bridge across ABL and SBL. The three theorems above (Theorem 11, Theorem 12 and Theorem 13) are for general non-i.i.d. cases, and constant error terms appear in the three theorems. As far as we know, it is still an open problem whether the constant error terms can be removed for ABL methods in general non-i.i.d. cases. Theorem 15 and Proposition 16 theoretically show the effectiveness of using local momentum for ABL methods in i.i.d. cases, which is consistent with previous works on SBL (Karimireddy et al., 2021). 5. Experiment In this section, we empirically evaluate the performance of BASGD (BASGDm) and baselines in both image classification (IC) and natural language processing (NLP) applications. Our Buffered Asynchronous SGD for Byzantine Learning experiments are conducted on a distributed platform with dockers. Each docker is bound to an NVIDIA Tesla V100 (32G) GPU. We choose 30 dockers as workers and an extra docker as server1. All algorithms are implemented with Py Torch 1.3. 5.1 Experimental Setting The performance of decentralized methods (El-Mhamdi et al., 2021a) will depend on the network topology, which makes it hard to conduct a fair comparison. Thus, we mainly consider methods under the PS framework in our experiments. Moreover, the server has no access to any training instances. The ABL methods Zeno++ (Xie et al., 2020b) and Sageflow (Park et al., 2021) need to store some instances on the server, and thus not applicable in the settings of this work. Hence, we compare BASGD (BASGDm) with baselines ASGD (ASGDm) and Kardam in our experiments. We set dampening function Λ(τ) = 1 1+τ for Kardam as suggested in (Damaskinos et al., 2018), and set momentum hyper-parameter µ = 0.9 for BASGDm and ASGDm in each experiment. Byzantine attacks. We will compare BASGD and BASGDm with baselines under the following different attack settings. No attack: In this setting, each worker will strictly follow the method, compute and send the gradient (or momentum) without error. Random disturbance attack (RD-attack): Byzantine workers with RD-attack will replace the true gradient (or momentum) g with g RD = g + grnd, where grnd is a random vector sampled from the normal distribution N(0, σatkg 2 I). Here, σatk is a parameter and I is the d-by-d identity matrix. We set σatk = 0.2 in our experiments. RD-attack can be seen as an accidental failure with expectation 0. Negative gradient attack (NG-attack): Byzantine workers with NG-attack will replace the true gradient (or momentum) g with g NG = katk g, where katk R+ is a parameter. We set katk = 10 in our experiments. NG-attack is a typical kind of malicious attack. In some previous works, this type of attack is also called bit-flipping attack (Xie et al., 2020b; Karimireddy et al., 2021). Fall of Empires (Fo E) attack (Xie et al., 2020a): Byzantine workers with Fo E attack will replace the gradient (or momentum) g with g Fo E = ϵ |L| P i L gi, where L is the index set of loyal workers and gi is the gradient (or momentum) computed by the i-th worker at the same iteration. We set hyper-parameter ϵ = 6 for Fo E attack in the experiments of this work. Fo E is a type of omniscient attack originally proposed in synchronous settings, which requires the gradients (or momentums) computed by loyal workers at the same iteration as omniscient knowledge. Thus, Fo E cannot be directly adopted in asynchronous settings. To deal with this problem, we use the last sent gradient (or momentum) from each loyal worker as the omniscient knowledge for Fo E. 1. In the conference version (Yang and Li, 2021), we set 8 workers in the NLP experiment. To make the settings more consistent with that of the IC experiment, we also set the worker number to 30 for the NLP experiment in this journal version. Yang and Li A Little is Enough (ALIE) attack (Baruch et al., 2019): Byzantine workers with ALIE attack will replace the gradient (or momentum) g with g ALIE, where ( g ALIE)j = meanj zmax stdj. The sub-index ( )j denotes the j-th coordinate of the vector. The scalars meanj and stdj are the mean and standard error of the j-th coordinate of loyal workers gradients (or momentums) at the same iteration, respectively. zmax = Φ 1(m m/2+1 m r ), where Φ 1( ) is the inverse of the standard normal cumulative distribution function, m is the number of workers, and r is the number of Byzantine workers. ALIE is also a type of omniscient attack originally proposed in synchronous settings. Similarly, to make it compatible with asynchronous settings, we use the last sent gradient (or momentum) from each loyal worker as the omniscient knowledge for ALIE. In real-world applications, it is usually hard to adopt the two types of omniscient attacks (Fo E and ALIE) due to the lack of omniscient knowledge. However, we still compare the performance of different methods under these two attacks to evaluate resilience ability. Aggregation rules. In the experiments, BASGD and BASGDm are evaluated with the following aggregation rules. Coordinate-wise q-trimmed-mean (trmean): Please refer to Definition 4. Coordinate-wise median (median): Please refer to Definition 3. Since median can be deemed as a special case of trmean, we only report the results of BASGD and BASGDm with median in the case of no attack2. Geometric median (geo Med) (Chen et al., 2017): The geometric median of B vectors h0, . . . , h B 1 Rd is defined as: geo Med([h0, . . . , h B 1]) = arg min h Rd The optimization problem defined in the right-hand side of (3) has a unique solution when vectors {h0, . . . , h B 1} do not lie in a line. However, geo Med usually does not have a closed-form solution. We use Weiszfeld s algorithm (Pillutla et al., 2019) to compute it and set the iteration number in Weiszfeld s algorithm to be 5. Centered clipping (CC) (Karimireddy et al., 2021): The CC aggregation result of vectors {h0, . . . , h B 1} is given by the following iteration formula: hl+1 = hl + 1 b=0 (hb hl) min 1, R hb hl 2 We set initial point h0 to be the last aggregation result for quicker convergence as suggested in (Karimireddy et al., 2021). The iteration number is set to be 5 in the IC task and 50 in the NLP task. Clipping size R is set to be 0.5. 2. In the conference version (Yang and Li, 2021), we report the results of BASGD with median in all cases. In this journal version, we evaluate BASGD (BASGDm) with two more aggregation rules (geometric median and centered clipping). Due to limited space in each single figure, we do not report the results of BASGD (BASGDm) with median for better readability in this journal version. The performance of median is similar to that of other aggregation rules. Buffered Asynchronous SGD for Byzantine Learning Table 1: Wall-clock-time of running 160 epochs for different methods (in seconds) Method ASGD BASGDm (B = 10) Kardam w/ trmean w/ geo Med w/ CC γ = 2 γ = 10 Wall-clock-time 1172.30 1191.01 1287.07 1289.32 1522.05 1535.22 To simulate an unstable network environment where asynchronous methods are usually preferred, each worker is manually set to have a delay, which is kdel times the computing time. The training set is randomly and equally distributed to different workers. For space saving, we will only present the average top-1 test accuracy (in IC) or average perplexity (in NLP) in this section. Average training loss w.r.t. epochs in the IC experiment can be found in Appendix C, which is consistent with the average top-1 test accuracy results presented in this section. Unless otherwise stated, for BASGD and BASGDm, the reassignment interval is set to be 1 second in the IC experiment and 5 seconds in the NLP experiment. 5.2 Image Classification Experiment In this part, we will empirically compare the performance of BASGD (BASGDm) and existing asynchronous methods ASGD (ASGDm) and Kardam in image classification tasks. In the experiment, algorithms are evaluated on CIFAR-10 (Krizhevsky et al., 2009) with deep learning model Res Net-20 (He et al., 2016). Cross-entropy is used as the loss function. kdel is randomly sampled from truncated standard normal distribution within [0, + ). As suggested in (He et al., 2016), learning rate η is set to 0.1 initially for each algorithm, and multiplied by 0.1 at the 80-th epoch and the 120-th epoch respectively. The weight decay is set to 10 4. We run each algorithm for 160 epochs. The batch size is set to 25. Firstly, we compare the performance of different methods when there are no Byzantine workers. Experimental results of BASGD and BASGDm are illustrated in Figure 3 and Figure 4, respectively. The solid line represents that the method does not use momentum while the dotted line represents that the method utilizes local momentum. ASGD (ASGDm) achieves the best performance. BASGD (BASGDm) (B > 1) and Kardam have similar convergence rates to ASGD (ASGDm), but both sacrifice a little accuracy. Furthermore, the performance of BASGD (BASGDm) gets worse when the buffer number B increases, which is consistent with the theoretical results. Please note that ASGD (ASGDm) is a degenerated case of BASGD (BASGDm) when B = 1 and Aggr(h1) = h1. Hence, BASGD (BASGDm) can achieve the same performance as ASGD (ASGDm) when there is no failure or attack. The wall clock time of running 160 epochs is reported in Table 1. The time cost of BASGDm is slightly larger than that of ASGD, while Kardam takes the most time. Then, for each type of attack, we compare the performance of BASGD (BASGDm) and Kardam by conducting two experiments in which there are 3 and 6 Byzantine workers, respectively3. We respectively set 10 and 15 buffers for BASGD (BASGDm) in these two 3. In the conference version (Yang and Li, 2021), we also report the experimental results of ASGD under attacks. However, due to limited space in figures, we do not report the results of ASGD and ASGDm in this journal version for better readability since ASGD and ASGDm are not Byzantine-resilient and achieve low accuracy under attack. Yang and Li 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy ASGD BASGD with median (B=5) BASGD with median (B=10) BASGD with median (B=15) BASGD with median (B=30) Kardam ( =2) Kardam ( =10) (a) BASGD with median 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy ASGD BASGD with trmean (B=5, q=1) BASGD with trmean (B=10, q=3) BASGD with trmean (B=15, q=5) BASGD with trmean (B=30, q=10) Kardam ( =2) Kardam ( =10) (b) BASGD with trmean 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy ASGD BASGD with geo Med (B=5) BASGD with geo Med (B=10) BASGD with geo Med (B=15) BASGD with geo Med (B=30) Kardam ( =2) Kardam ( =10) (c) BASGD with geo Med 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy ASGD BASGD with CC (B=5) BASGD with CC (B=10) BASGD with CC (B=15) BASGD with CC (B=30) Kardam ( =2) Kardam ( =10) (d) BASGD with CC Figure 3: Average top-1 test accuracy w.r.t. epochs of methods BASGD, ASGD, and Kardam when there are no Byzantine workers. experiments. The experimental results of the methods under two types of non-omniscient attacks (RD-attack and NG-attack) are presented in Figure 5. We can find that BASGD (BASGDm) significantly outperforms Kardam under these two types of non-omniscient attacks. Under the less harmful RD-attack, although Kardam still converges, it suffers a significant loss in accuracy. Under NG-attack, Kardam cannot converge even if we have tried different values of assumed Byzantine worker number for Kardam, which is denoted by the hyperparameter γ in this paper. Hence, Kardam cannot resist these two types of attacks. On the contrary, BASGD still has a relatively good performance under these two types of attacks. Moreover, we count the ratio of filtered gradients in Kardam, which is shown in Table 2. We can find that in order to filter Byzantine gradients, Kardam also filters approximately an equal ratio of loyal gradients. It explains why Kardam performs poorly under the attack. Buffered Asynchronous SGD for Byzantine Learning 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy ASGDm BASGDm with median (B=5) BASGDm with median (B=10) BASGDm with median (B=15) BASGDm with median (B=30) Kardam ( =2) Kardam ( =10) (a) BASGDm with median 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy ASGDm BASGDm with trmean (B=5, q=1) BASGDm with trmean (B=10, q=3) BASGDm with trmean (B=15, q=5) BASGDm with trmean (B=30, q=10) Kardam ( =2) Kardam ( =10) (b) BASGDm with trmean 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy ASGDm BASGDm with geo Med (B=5) BASGDm with geo Med (B=10) BASGDm with geo Med (B=15) BASGDm with geo Med (B=30) Kardam ( =2) Kardam ( =10) (c) BASGDm with geo Med 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy ASGDm BASGDm with CC (B=5) BASGDm with CC (B=10) BASGDm with CC (B=15) BASGDm with CC (B=30) Kardam ( =2) Kardam ( =10) (d) BASGDm with CC Figure 4: Average top-1 test accuracy w.r.t. epochs of methods BASGDm, ASGDm, and Kardam when there are no Byzantine workers. Table 2: Filtered ratio in Kardam under NG-attack in IC task (3 Byzantine workers) Term By Frequency Filter By Lipschitz Filter In total Loyal Grads (γ = 3) 10.15% (31202/307530) 40.97% (126000/307530) 51.12% Byzt Grads (γ = 3) 10.77% (3681/34170) 40.31% (13773/34170) 51.08% Loyal Grads (γ = 8) 28.28% (86957/307530) 28.26% (86893/307530) 56.53% Byzt Grads (γ = 8) 28.38% (9699/34170) 28.06% (9588/34170) 56.44% Loyal Grads (γ = 14) 85.13% (261789/307530) 3.94% (12117/307530) 89.07% Byzt Grads (γ = 14) 84.83% (28985/34170) 4.26% (1455/34170) 89.08% We also compare the performance of different methods under omniscient attacks (Fo E attack and ALIE attack), the results of which are shown in Figure 6. BASGDm can Yang and Li 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy BASGD with CC BASGDm with CC BASGD with trmean (q=3) BASGDm with trmean (q=3) BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =6) (a) 3 Byzantine workers with RD-attack 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy BASGD with CC BASGDm with CC BASGD with trmean (q=3) BASGDm with trmean (q=3) BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =8) Kardam ( =14) (b) 3 Byzantine workers with NG-attack 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy BASGD with CC BASGDm with CC BASGD with trmean (q=6) BASGDm with trmean (q=6) BASGD with geo Med BASGDm with geo Med Kardam ( =6) Kardam ( =10) (c) 6 Byzantine workers with RD-attack 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy BASGD with CC BASGDm with CC BASGD with trmean (q=6) BASGDm with trmean (q=6) BASGD with geo Med BASGDm with geo Med Kardam ( =6) Kardam ( =10) Kardam ( =14) (d) 6 Byzantine workers with NG-attack Figure 5: Average top-1 test accuracy w.r.t. epochs under non-omniscient attacks. B = 10 for BASGD (BASGDm) when there are 3 Byzantine workers and B = 15 for BASGD (BASGDm) when there are 6 Byzantine workers. significantly outperform other methods in each case, except for the case of 3 Byzantine workers with ALIE attack. When there are 3 Byzantine workers with ALIE attack, all the methods have a comparable performance to each other. The main reason is that the Byzantine attack is not strong enough in this case. In addition, the performance of BASGDm is considerably better than BASGD. This reveals that using history information (such as momentum) can strengthen the resilience and improve the performance in Byzantine-resilient machine learning, which is consistent with previous works (Allen-Zhu et al., 2020; El-Mhamdi et al., 2021b; Karimireddy et al., 2021). Moreover, although the performance of BASGDm with different aggregation rules (trmean, geo Med, and CC) slightly differ, all of them are better than those of BASGD and Kardam. In addition, we evaluate the effect of the buffer reassignment interval . Specifically, we will compare the performance of BASGDm when is set to 0.01s, 0.1s, 1s, 10s, and Buffered Asynchronous SGD for Byzantine Learning 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy BASGD with CC BASGDm with CC BASGD with trmean BASGDm with trmean BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =6) (a) 3 Byzantine workers with Fo E attack 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy BASGD with CC BASGDm with CC BASGD with trmean BASGDm with trmean BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =6) (b) 3 Byzantine workers with ALIE attack 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy BASGD with CC BASGDm with CC BASGD with trmean BASGDm with trmean BASGD with geo Med BASGDm with geo Med Kardam ( =10) Kardam ( =14) (c) 6 Byzantine workers with Fo E attack 0 20 40 60 80 100 120 140 160 Epoch Average Top-1 Accuracy BASGD with CC BASGDm with CC BASGD with trmean BASGDm with trmean BASGD with geo Med BASGDm with geo Med Kardam ( =6) Kardam ( =10) (d) 6 Byzantine workers with ALIE attack Figure 6: Average top-1 test accuracy w.r.t. epochs under omniscient attacks. B = 10 for BASGD (BASGDm) when there are 3 Byzantine workers and B = 15 for BASGD (BASGDm) when there are 6 Byzantine workers. 100s. In this experiment, there are 3 Byzantine workers under omniscient attacks (Fo E and ALIE). We set B = 10 for BASGDm. Since the buffer reassignment technique is used to deal with the straggler buffer in the extreme case, we set 3 extra workers to be stragglers, which take 10 times longer to finish local computation than the other workers. We also evaluate the performance of Kardam and synchronous SGD with momentum (SSGDm) in this setting. For both SSGDm and BASGDm, the aggregation rule is set to be trmean and the momentum hyper-parameter µ is set to 0.9. The average top-1 test accuracy w.r.t. wall-clock time of different methods is illustrated in Figure 7. As we can see, the value of buffer reassignment interval significantly affects the performance of BASGDm. When is set too small ( = 0.01s), buffers will be zeroed out too frequently and the global model updating is hardly executed on the server. When is set too large ( = 100s), the straggler buffer will not be promptly eliminated by buffer Yang and Li 0 100 200 300 400 500 600 700 Wall-Clock Time (s) Average Top-1 Accuracy SSGDm BASGDm (B=10, =0.01s) BASGDm (B=10, =0.1s) BASGDm (B=10, =1s) BASGDm (B=10, =10s) BASGDm (B=10, =100s) Kardam ( =6) (a) Under Fo E attack 0 100 200 300 400 500 600 700 Wall-Clock Time (s) Average Top-1 Accuracy SSGDm BASGDm (B=10, =0.01s) BASGDm (B=10, =0.1s) BASGDm (B=10, =1s) BASGDm (B=10, =10s) BASGDm (B=10, =100s) Kardam ( =6) (b) Under ALIE attack Figure 7: Average top-1 test accuracy w.r.t. wall-clock time when there are 3 Byzantine workers under omniscient attacks. In synchronous SGD with momentum (SSGDm) and BASGDm, the aggregation rules are set to be trmean. B is set to 10 for BASGDm. reassignment. Meanwhile, BASGDm performs stably and outperforms SSGDm and Kardam when ranges from 0.1s to 10s. The empirical results about hyper-parameter is consistent with our discussion in Section 3. 5.3 Natural Language Processing Experiment In this part, we will empirically compare the methods on natural language processing (NLP) tasks. In our NLP experiment, the methods are evaluated on the Wiki Text-2 dataset with an LSTM (Hochreiter and Schmidhuber, 1997) network. We only use the training set and test set, while the validation set is not used in our experiment. For LSTM, we adopt 2 layers with 100 units in each layer. Word embedding size is set to 100, and the sequence length is set to 35. Gradient clipping size is set to 0.25. Cross-entropy is used as the loss function. We run each algorithm for 40 epochs. Initial learning rate η is chosen from {1, 2, 5, 10, 20} and is divided by 4 at the 21-st epoch and the 31-st epoch. The best result is adopted as the final one. kdel is randomly sampled from a standard exponential distribution. Similarly, each method is evaluated under RD-attack, NG-attack, Fo E attack, and ALIE attack. The average perplexity is reported in Figure 8. As illustrated in Figure 8(a) and Figure 8(b), under the two types of non-omniscient attacks (RD-attack and NG-attack), BASGD (BASGDm) can outperform Kardam, no matter which of the three aggregation rules is used. Moreover, the curves representing Kardam do not appear in Figure 8(b) because Kardam diverges under NG-attack and the perplexity explodes. We would also like to clarify that the performance of CC can get further improved by tuning the clipping size hyper-parameter more finely in different settings. However, it requires much computing cost and is beyond the scope of this work. Therefore, we fix clipping size R = 0.5, and this can already make BASGDm with CC outperform Buffered Asynchronous SGD for Byzantine Learning 0 5 10 15 20 25 30 35 40 Epoch Average Perplexity BASGD with CC BASGDm with CC BASGD with trmean (q=3) BASGDm with trmean (q=3) BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =8) Kardam ( =14) (a) Under RD-attack 0 5 10 15 20 25 30 35 40 Epoch Average Perplexity BASGD with CC BASGDm with CC BASGD with trmean (q=3) BASGDm with trmean (q=3) BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =8) Kardam ( =14) (b) Under NG-attack 0 5 10 15 20 25 30 35 40 Epoch Average Perplexity BASGD with CC BASGDm with CC BASGD with trmean (q=3) BASGDm with trmean (q=3) BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =8) Kardam ( =14) (c) Under Fo E attack 0 5 10 15 20 25 30 35 40 Epoch Average Perplexity BASGD with CC BASGDm with CC BASGD with trmean (q=3) BASGDm with trmean (q=3) BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =8) Kardam ( =14) (d) Under ALIE attack Figure 8: Average perplexity w.r.t. epochs (3 Byzantine workers, B = 15 for BASGD and BASGDm). In subfigure (b), the curves representing Kardam do not appear because Kardam diverges in this case and the average perplexity explodes. Kardam. Theoretically, the best performance of CC can not be worse than geo Med since CC is equivalent to geo Med when the clipping size R is small enough (please see Appendix B.10 for the proof). As illustrated in Figure 8(c) and Figure 8(d), under Fo E attack and ALIE attack, BASGD can outperform Kardam except for the case of using trmean as aggregation rule. BASGD with trmean performs slightly worse than Kardam. A possible reason is that trmean is sensitive to model dimensions. On the contrary, by using momentum, BASGDm with any aggregation rule can always outperform Kardam. The experimental results in this section have shown that BASGD (BASGDm) can outperform asynchronous Byzantine learning baselines under different settings. Moreover, BASGD (BASGDm) is compatible with various aggregation rules, such as trmean, geo Med, Yang and Li and CC. With the benefit of local momentum, BASGDm gets even stronger Byzantine resilience than BASGD, especially under the omniscient attacks Fo E and ALIE. 6. Conclusion In this paper, we propose a novel method called BASGD. To the best of our knowledge, BASGD is the first ABL method that can resist non-omniscient attacks without storing any instances on the server. Compared with those methods which need to store instances on the server, BASGD has a wider scope of application. An improved variant of BASGD, called BASGD with momentum (BASGDm), is further proposed by introducing local momentum into BASGD. As far as we know, BASGDm is the first ABL method that can resist the two omniscient attacks Fall of Empires and A Little is Enough . Both BASGD and BASGDm are compatible with various aggregation rules. Moreover, both BASGD and BASGDm are proved to be convergent and able to resist failure or attack. Empirical results show that our methods significantly outperform existing ABL baselines when there exists failure or attack on workers. Furthermore, both the theoretical results and the empirical results show the advantages of using local momentum in BASGDm. Acknowledgments This work is supported by National Key R&D Program of China (No. 2020YFA0713900), NSFC Project (No. 61921006, No. 62192783), and Fundamental Research Funds for the Central Universities (No. 020214380108). Buffered Asynchronous SGD for Byzantine Learning Appendix A. Asynchronous SGD (ASGD) One popular asynchronous method to solve the problem in (1) under the PS framework is ASGD (Dean et al., 2012), which is presented in Algorithm 3. Algorithm 3 Asynchronous SGD (ASGD) Server: Initialization: initial parameter w0, learning rate η; Send initial w0 to all workers; for t = 0 to tmax 1 do Wait until a new gradient gt k is received from arbitrary worker k; Execute SGD step: wt+1 wt η gt k; Send wt+1 back to worker k; end for Notify all workers to stop; Worker k: (k = 0, 1, ..., m 1) repeat Wait until receiving the latest parameter w from server; Randomly sample an index i from Dk and compute f(w; zi); Send f(w; zi) to server; until receive server s notification to stop Appendix B. Proof Details B.1 Proof of Proposition 5 Proof Firstly, we prove that coordinate-wise q-trimmed-mean is q-BR. It is not hard to check that trmean satisfies the property (a) in the definition of q-BR, then we prove that it also satisfies property (b). Without loss of generality, we assume h1j, . . . , h Bj are already in descending order. By definition, Trm(h j) is the average value of Mj, which is obtained by removing q largest values and q smallest values of {hij}B i=1. Therefore, h(q+1)j = max x Mj{x} Trm(h j) min x Mj{x} = h(n q)j. For any S {0, . . . , B 1} with |S| = B q, by Pigeonhole Principle, S includes at least one of h1j, . . . , h(q+1)j, and includes at least one of h(n q)j, . . . , h Bj. Therefore, max s S {hsj} h(q+1)j; min s S {hsj} h(n q)j. Combining these two inequalities, we have: max s S {hsj} Trm(h j) min s S {hsj}. Thus, coordinate-wise q-trimmed-mean is q-BR. By definition, coordinate-wise median can be seen as B 1 2 -trimmed-mean, and thus is B 1 Yang and Li B.2 Proof of Lemma 9 To begin with, we will introduce a lemma to estimate the ordered statistics. Lemma 17. X1, . . . , XM are non-negative, independent and identically distributed (i.i.d.) random variables sampled from distribution D, and have limited expectation E[X]. Denote the K-th largest value in {X1, . . . , XM} as X(K), then E[X(K)] CM,K E[X], where ( M, K = 1; M!(K 1)K 1(M K)M K (K 1)!(M K)!(M 1)M 1 , 1 < K < M Proof Denote the Probability Density Function (PDF) and Cumulative Density Function (CDF) of D as p(x) and P(x), respectively. Then the PDF of X(K) is: p(K)(x) = M! (K 1)!(M K)![1 P(x)]K 1P(x)M Kp(x). E[X(K)] = Z + 0 x p(K)(x)dx M! (K 1)!(M K)! [1 P(x)]K 1P(x)M K xp(x)dx M! (K 1)!(M K)! (K 1)K 1(M K)M K = M!(K 1)K 1(M K)M K (K 1)!(M K)!(M 1)M 1 E[X]. Inequality (a) is derived based on [1 P(x)]K 1P(x)M K (K 1)K 1(M K)M K (M 1)M 1 , which is obtained by the following process: Let θ(x) = (1 x)K 1x M K, x [0, 1]. Then θ (x) = (1 x)K 2x M K 1[(M K)(1 x) (K 1)x]. Let θ (x) = 0. Solving the equation, we obtain x = M K M 1 , 0 or 1. Also, we have θ(0) = θ(1) = 0, and θ(M K M 1 ) = (K 1)K 1(M K)M K Then we have maxx [0,1] θ(x) = θ(M K M 1 ) = (K 1)K 1(M K)M K Thus, [1 P(x)]K 1P(x)M K = θ(P(x)) (K 1)K 1(M K)M K Proposition 18. B, q, r Z+, 0 r q < B CB r,q r+1 ΘB,q,r = (B r) (B q 1)(q r + 1) . Proof By Stirling s approximation, we have: 2πn nne n n! e n nne n, n Z+. Buffered Asynchronous SGD for Byzantine Learning nn e n e n, n Z+. (5) By definition of CM,k, CM,K = M!(K 1)K 1(M K)M K (K 1)!(M K)!(M 1)M 1 =M (M 1)! (M 1)M 1 (K 1)K 1 (K 1)! (M K)M K M 1 e (M 1)] e K 1 p 2π(K 1) e M K p (M K)(K 1) , where the inequality uses Inequality (5). Case (i). When r < q, CB r,q r+1 e (B q 1)(q r) (B q 1)(q r + 1) . Case (ii). When r = q, by definition of CM,K, we have: CB r,q r+1 = CB q,1 = B q = (B r) (B q 1)(q r + 1) . In conclusion, when r q, we have: CB r,q r+1 (B r) (B q 1)(q r + 1) . When B and q are fixed, the upper bound of CB r,q r+1 will increase when r (the number of Byzantine workers) increases. Namely, the upper bound will be larger if there are more Byzantine workers. When B and r are fixed, q measures the Byzantine Robust degree of aggregation function Aggr( ). The factor [(B q 1)(q r)] 1 2 is monotonically decreasing with respect to q, when q < B 1+r 2 . Since r q < B 2 , the upper bound will decrease when q increases. Also, B q decreases when q increases. Namely, the upper bound will be smaller if Aggr( ) has a stronger q-BR property. In the worst case (q = r), the upper bound of CB r,q r+1 is linear to B. Even in the best case (r = 0, q = B 1 2 ), the denominator is about B 2 and the upper bound of CB r,q r+1 is linear to B. Thus, larger B might result in larger error. Hence, the buffer number is not supposed to be set too large. Yang and Li Now we prove Lemma 9. Proof E[||Gt||2 | wt] =E[||Aggr([h0, . . . , h B 1])||2 | wt] j=1 E[Aggr([h0, . . . , h B 1])2 j | wt], where Aggr([h0, . . . , h B 1])j represents the j-th coordinate of the aggregated gradient. We use Ht to denote the credible buffer index set, which is composed of the index of buffers, where the stored gradients are all from loyal workers. For each b Ht, hb has stored Nt b gradients at iteration t: g1, . . . , g Nt b, and we have: E[ hb 2|wt] =E[ hb E[hb|wt] 2|wt] + E[hb|wt] 2 l=1 (gl E[gl|wt]) Nt b + 1 (Nt b)2 l=1 E[gl|wt] Nt b + 1 (Nt b)2 Nt b l=1 E[gl|wt] 2 Inequality (a) is derived based on Assumption 4 and the fact that gi is mutually uncorrelated. Inequality (b) is derived by the following process: l=1 E[gl|wt] l=1 E[gl|wt] 2 + X 2 E[gl|wt]T E[g l|wt] l=1 E[gl|wt] 2 + X ( E[gl|wt] 2 + E[g l|wt] 2) Buffered Asynchronous SGD for Byzantine Learning l=1 E[gl|wt] 2 + (Nt b 1) l=1 E[gl|wt] 2 i=l E[gl|wt] 2. Inequality (c) is derived based on Assumption 3. Because there are no more than r Byzantine workers at iteration t, no more than r buffers contain Byzantine gradient. Thus, the credible buffer index set Ht has at least (B r) elements. In case that Ht has more than (B r) elements, we take the indices of the smallest (B q) elements in {hbj}b Ht to compose Ht j, and we have |Ht j| = B q. Note that Aggr( ) is q-BR, and by definition we have: min b Ht j {hbj} Aggr([h0, . . . , h B 1])j max b Ht j {hbj}. j=1 E[Aggr([h0, . . . , h B 1])2 j|wt] j=1 E[max b Ht j {h2 bj}|wt]. There are (B r) credible buffers, and we choose the smallest (B q) buffers to compose Ht j. Therefore, for all b Ht j, hbj is not larger than the (q r+1)-th largest one in {hbj}b Ht. Let N(t) be the (q + 1)-th smallest value in {Nt b}b {0,...,B 1}. Using Lemma 17, we have: E[max b Ht j {h2 bj}|wt] E[max b Ht j { hb 2}|wt] E[max b Ht j {D2 + σ2 =CB r,q r+1 (D2 + σ2 E[||Gt||2 | wt] j=1 E[max b Ht j {h2 bj}|wt] CB r,q r+1d (D2 + σ2 By Proposition 18, we have: E[||Gt||2 | wt] d (B r) (B q 1)(q r + 1) (D2 + σ2 Yang and Li B.3 Proof of Lemma 10 E[Gt F(wt) | wt] =E[Aggr([h0, . . . , h B 1]) F(wt) | wt] =E[Aggr([h0 F(wt), . . . , h B 1 F(wt)]) | wt], (6) where the second equation is derived based on Property (b) in the definition of q-BR. For each b Ht, hb has stored Nt b gradients at iteration t: g1, . . . , g Nt b, and we have: hb F(wt) = 1 k=1 gi F(wt) = 1 k=1 [ f(wtk; zik) F(wt)], where 0 t tk τmax, k = 1, 2, . . . , Nt b. Taking expectations on both sides, we have: E[||hb F(wt)|| |wt] k=1 ( f(wtk; zik) F(wt))|| |wt] k=1 E[|| f(wtk; zik) F(wt)|| |wt] k=1 {E[|| F(wtk) F(wt)|| |wt] + E[|| f(wtk; zik) E[ f(wtk; zik)]|| |wt] + E[||E[ f(wtk; zik)] F(wtk)|| |wt]}, where (a) is derived based on Triangle Inequality. The first part: E[|| F(wtk) F(wt)|| |wt] (b) L E[||wtk wt|| |wt] t =tk Gt || |wt] t =tk L E[||Gt || |wt] E[||Gt || |wt]2 Buffered Asynchronous SGD for Byzantine Learning E[||Gt ||2 |wt] CB r,q r+1d (D2 + σ2/N(t)) (d) τmax L q CB r,q r+1d (D2 + σ2/N(t)), where (b) is derived based on Assumption 5, (c) is derived based on Lemma 9, and (d) is derived based on t tk τmax. The second part: E[|| f(wtk; zik) E[ f(wtk; zik)]|| |wt] E[|| f(wtk; zik) E[ f(wtk; zik)]|| |wt]2 E[|| f(wtk; zik) E[ f(wtk; zik)]||2 |wt] where (e) is derived based on Assumption 4. By Assumption 2, we have the following estimation for the third part: E[||E[ f(wtk; zik)] F(wtk)|| |wt] κ. E[||hb F(wt)|| |wt] k=1 (τmax L q CB r,q r+1d (D2 + σ2/N(t)) + σ + κ) CB r,q r+1d (D2 + σ2/N(t)) + σ + κ. (7) Similar to the proof of Lemma 9, j [d], we have: min b Ht j {hbj F(wt)j} Aggr([h0 F(wt), . . . , h B 1 F(wt)])j max b Ht j {hbj F(wt)j}, where Ht j is composed by the indices of the smallest (B q) elements in {hbj F(wt)j}b Ht. Therefore, ||E[Aggr([h0 F(wt), . . . , h B 1 F(wt)]) | wt]|| j=1 ||E[Aggr([h0 F(wt), . . . , h B 1 F(wt)])j | wt]|| Yang and Li j=1 E[||Aggr([h0 F(wt), . . . , h B 1 F(wt)])j|| | wt] j=1 E[max b Ht j ||hbj F(wt)j|| | wt] j=1 CB r,q r+1E[||hbj F(wt)j|| |wt] j=1 CB r,q r+1E[||hb F(wt)|| |wt] j=1 CB r,q r+1 τmax L q CB r,q r+1d (D2 + σ2/N(t)) + σ + κ =CB r,q r+1d τmax L q CB r,q r+1d (D2 + σ2/N(t)) + σ + κ , (8) where (f) is derived based on definition of q-BR, (g) is derived based on Lemma 17, and (h) is derived based on Inequality (7). Combining Equation (6) and Inequality (8), we obtain: ||E[Gt F(wt) | wt]|| CB r,q r+1d τmax L q CB r,q r+1d (D2 + σ2/N(t)) + σ + κ . By Proposition (18), we have: ||E[Gt F(wt) | wt]|| d(B r) (B q 1)(q r + 1) (B q 1)(q r + 1) (D2 + σ2/N(t)) + σ + κ B.4 Proof of Theorem 11 E[F(wt+1) | wt] =E[F(wt η Gt) | wt] (a) E[F(wt) η F(wt)T Gt + L 2 η2||Gt||2 | wt] =F(wt) η E[ F(wt)T Gt | wt] + η2L 2 E[||Gt||2 | wt] =F(wt) η F(wt)T E[Gt | wt] + η2L 2 E[||Gt||2 | wt] =F(wt) η F(wt)T F(wt) + η2L 2 E[||Gt||2 | wt] Buffered Asynchronous SGD for Byzantine Learning η F(wt)T E[Gt F(wt) | wt] F(wt) η || F(wt)||2 + η2L 2 E[||Gt||2 | wt] + η || F(wt)|| ||E[Gt F(wt) | wt]||, where (a) is derived based on Assumption 5. Using Lemma 9 and Lemma 10, we have: E[F(wt+1) | wt] F(wt) η || F(wt)||2 + η2L 2 CB r,q r+1d (D2 + σ2/N(t)) + η CB r,q r+1d (τmax L q CB r,q r+1d (D2 + σ2/N(t)) + σ + κ) || F(wt)||. Also, by Assumption 3, || F(wt)|| D. Taking total expectation and using that || F(wt)|| D, we have: E[F(wt+1)] E[F(wt)] η E[|| F(wt)||2] + η2L 2 CB r,q r+1d (D2 + σ2/N(t)) + η CB r,q r+1Dd(τmax L q CB r,q r+1d (D2 + σ2/N(t)) + σ + κ). T PT 1 t=0 p D2 + σ2/N(t). By telescoping, we have: t=0 E[|| F(wt)||2] {F(w0) E[F(w T )]} + η2T L 2 CB r,q r+1d 1 t=0 (D2 + σ2/N(t)) + ηT CB r,q r+1Dd(τmax L D p CB r,q r+1d + σ + κ). Note that E[F(w T )] F , and let η = O 1 L PT 1 t=0 E[|| F(wt)||2] T O L[F(w0) F ] CB r,q r+1 Dd + O CB r,q r+1Dd (τmax L D p CB r,q r+1d + σ + κ) . Let δmax = q/B. When q = r, we have CB r,q r+1 (q/δmax r) p q/δmax r + 1 p (q/δmax q 1)(q r + 1) r/δmax r + 1 r/δmax r 1 (1 δmax)r δmax 2(1 δmax)r PT 1 t=0 E[|| F(wt)||2] T O L[F(w0) F ] 2(1 δmax)rd D Yang and Li 2(1 δmax)r Ddσ δmax + 2(1 δmax)r Ddκ 2(1 δmax) 3 2 r 3 2 LD Dd 3 2 τmax (δmax) 3 2 B.5 Proof of Theorem 12 Proof Let h b be the value of the b-th buffer, if all received loyal gradients were computed based on wt. Since Gt = Aggr([h0, . . . , h B 1]), we have: E[F(wt+1) | wt] =E[F(wt η Gt) | wt] (a) E[F(wt) η F(wt)T Gt + L 2 η2||Gt||2 | wt] =F(wt) η E[ F(wt)T Gt | wt] + η2L 2 E[||Gt||2 | wt], (9) where (a) is derived based on Assumption 5. Firstly, we estimate the value of E[ F(wt)T Gt | wt]. Since there are at most r Byzantine workers, at most r buffers may contain Byzantine gradients. Without loss of generality, suppose only the first r buffers may contain Byzantine gradients. Let Gt syn = Aggr([h0, . . . , hr 1, h r, . . . , h B 1]), where h0, . . . , hr 1 may contain Byzantine gradients and be arbitrary value, and h r, . . . , h B 1 each stores loyal gradients computed based on wt. Thus, E[ F(wt)T Gt syn | wt] F(wt) 2 A1, (10) E[ Gt syn 2 | wt] (A2)2. (11) Let α = 2η2L2τ 2 max(B r) < 1. We claim that E[ Gt Gt syn 2 | wt] (1 2αt+1 + α 1 α) (A2)2, and E[ Gt 2 | wt] (αt+1 + 2 1 α) (A2)2. Now we prove it by induction on t. Step 1. When t = 0, all gradients are computed according to w0, and we have G0 = G0 syn. Thus, E[ G0 G0 syn 2 | w0] = 0 (1 2α1 + α 1 α) (A2)2, E[ G0 2 | w0] = E[ G0 syn 2 | w0] (A2)2 (α1 + 2 1 α) (A2)2. Buffered Asynchronous SGD for Byzantine Learning E[ Gt Gt syn 2 | wt ] (1 2αt +1 + α 1 α) (A2)2, E[ Gt 2 | wt ] (αt +1 + 2 1 α) (A2)2, holds for all t = 0, 1, . . . , t 1 (induction hypothesis), then: E[ Gt Gt syn 2 | wt] =E[ Aggr([h0, . . . , hr 1, hr, . . . , h B 1]) Aggr([h0, . . . , hr 1, h r, . . . , h B 1]) 2 | wt] b=r hb h b 2 | wt] k=1 ( f(wtk; zik) f(wt; zik)) 2 | wt] k=1 f(wtk; zik) f(wt; zik) 2 | wt] k=1 L2 wtk wt 2 | wt] k=1 E[ wtk wt 2 | wt] t =tk η Gt 2 | wt] k=1 E[(t tk) t =tk Gt 2 | wt] k=1 [(t tk) t =tk (αt +1 + 2 1 α) (A2)2] k=1 [(t tk) t =tk (αt + 2 1 α) (A2)2] b=r (η2L2τ 2 max) (αt + 2 1 α) (A2)2 =(η2L2(B r)τ 2 max) (αt + 2 1 α) (A2)2 2α (αt + 2 1 α) (A2)2 2αt+1 + α 1 α) (A2)2, (12) Yang and Li where (b) is derived based on the definition of stable aggregation function, (c) is derived based on Cauchy s Inequality, (d) is derived based on Assumption 5, (e) is also derived based on Cauchy s Inequality, (f) is derived based on the induction hypothesis, (g) is derived based on that t tk τmax, and (h) is derived based on that α = 2η2L2τ 2 max(B r). Therefore, E[ Gt 2 | wt] =E[||Gt syn + (Gt Gt syn)||2 | wt] (i) 2 E[ Gt syn 2 | wt] + 2 E[||Gt Gt syn||2 | wt] (j) 2 (A2)2 + 2 E[||Gt Gt syn||2 | wt] (k) 2 (A2)2 + 2 (1 2αt+1 + α 1 α) (A2)2 =(αt+1 + 2 1 α) (A2)2, (13) where (i) is derived based on that x + y 2 2 x 2 + 2 y 2, x, y Rd, (j) is derived by the definition of (δmax, A1, A2)-effective aggregation function, and (k) is derived based on Inequality (12). By Inequality (12) and (13), the claimed property also holds for t = t. In conclusion, for all t = 0, 1, . . . , T 1, we have: E[ Gt Gt syn 2 | wt] (1 2αt+1 + α 1 α) (A2)2, (14) and E[ Gt 2 | wt] (αt+1 + 2 1 α) (A2)2. (15) Also, E[ Gt | wt]2 + V ar[ Gt | wt] = E[ Gt 2 | wt]. Therefore, E[ Gt | wt] = p E[ Gt | wt]2 αt+1 + 2 1 α A2. (16) η E[ F(wt)T Gt | wt] =η E[ F(wt)T Gt syn | wt] + η E[ F(wt)T (Gt Gt syn) | wt] (l) η ( F(wt) 2 A1) + η E[ F(wt)T (Gt Gt syn) | wt] η F(wt) 2 η A1 η F(wt) E[(Gt Gt syn) | wt] (m) η F(wt) 2 η A1 η D E[(Gt Gt syn) | wt] (n) η F(wt) 2 η A1 η D 1 2αt+1 + α 1 α A2, (17) where (l) is derived based on the definition of (δmax, A1, A2)-effective aggregation function, (m) is derived by Assumption 3, and (n) is derived based on Inequality (14). Buffered Asynchronous SGD for Byzantine Learning Combining Inequalities (9), (15), (17) and taking total expectation, we have: E[F(wt+1)] E[F(wt)] η E[ F(wt) 2] + η A1 + η D 1 2αt+1 + α 1 α A2 + 1 2η2L(αt+1 + 2 1 α) (A2)2. By telescoping, we have: t=0 E[ F(wt) 2] {F(w0) E[F(w T )]} + 1 2η2TL(α + 2 1 α) (A2)2 + ηTA1 + ηTD 1 2α + α 1 α A2. Divide both sides of the equation by ηT, and let η = O( 1 PT 1 t=0 E[ F(wt) 2] {F(w0) E[F(w T )]} 2ηL(α + 2 1 α) (A2)2 + A1 + D 1 2α + α 1 α A2 L[F(w0) F ] 2α + 1 1 α) (A2)2 T + A1 + α 1 2 [ 3 α 2(1 α)] 1 2 DA2. Since η = O( 1 LT ) and B = r/δmax + 1, we have that α = 2η2L2τ 2 max(B r) = O Lτ 2 max(r δmaxr + 1) Finally, it is obtained that: PT 1 t=0 E[ F(wt) 2] L [F(w0) F ] L(A2)2(1 + α) + O α 1 2 DA2 + A1 L 1 2 [F(w0) F ] (r δmaxr + 1) 1 2 L 1 2 τmax DA2 1 2max T 1 2 L 1 2 (A2)2 (r δmaxr + 1)L 3 2 (A2)2τ 2 max δmax T 3 2 Yang and Li B.6 Proof of Theorem 13 Proof The proof of this theorem is similar to that of Theorem 12. The main differences are the choices of the values α (in Theorem 12) and α (here in Theorem 13). For more readability, we still present the detailed proof processes here. Let h b be the value of the b-th buffer, if all received loyal gradients were computed based on wt. Since Gt = Aggr([h0, . . . , h B 1]), we have: E[F(wt+1) | wt] =E[F(wt η Gt) | wt] (a) E[F(wt) η F(wt)T Gt + L 2 η2||Gt||2 | wt] =F(wt) η E[ F(wt)T Gt | wt] + η2L 2 E[||Gt||2 | wt], (18) where (a) is derived based on Assumption 5. Firstly, we estimate the value of E[ F(wt)T Gt | wt]. Since there are at most r Byzantine workers, at most r buffers may contain Byzantine gradients. Without loss of generality, suppose only the first r buffers may contain Byzantine gradients. Let Gt syn = Aggr([h0, . . . , hr 1, h r, . . . , h B 1]), where h1, . . . , hr may contain Byzantine gradients and be arbitrary value, and h r, . . . , h B 1 each stores loyal gradients computed based on wt. Thus, E[ F(wt)T Gt syn | wt] F(wt) 2 A1, (19) E[ Gt syn 2 | wt] (A2)2. (20) Let α = 2η2L2τ 2 max(1 µ)2(B r) < 1. We claim that E[ Gt Gt syn 2 | wt] (1 2 αt+1 + α 1 α) (A2)2, and E[ Gt 2 | wt] ( αt+1 + 2 1 α) (A2)2. Now we prove it by induction on t. Step 1. When t = 0, all gradients are computed according to w0, and we have G0 = G0 syn. Thus, E[ G0 G0 syn 2 | w0] = 0 (1 2 α1 + α 1 α) (A2)2, E[ G0 2 | w0] = E[ G0 syn 2 | w0] (A2)2 ( α1 + 2 1 α) (A2)2. E[ Gt Gt syn 2 | wt ] (1 2 αt +1 + α 1 α) (A2)2, Buffered Asynchronous SGD for Byzantine Learning E[ Gt 2 | wt ] ( αt +1 + 2 1 α) (A2)2, holds for all t = 0, 1, . . . , t 1 (induction hypothesis), then: E[ Gt Gt syn 2 | wt] =E[ Aggr([h0, . . . , hr 1, hr, . . . , h B 1]) Aggr([h0, . . . , hr 1, h r, . . . , h B 1]) 2 | wt] b=r hb h b 2 | wt] k=1 (1 µ)( f(wtk; zik) f(wt; zik)) 2 | wt] k=1 (1 µ)( f(wtk; zik) f(wt; zik)) 2 | wt] k=1 (1 µ)2L2 wtk wt 2 | wt] k=1 E[ wtk wt 2 | wt] t =tk η Gt 2 | wt] k=1 E[(t tk) t =tk Gt 2 | wt] k=1 [(t tk) t =tk ( αt +1 + 2 1 α) (A2)2] k=1 [(t tk) t =tk ( αt + 2 1 α) (A2)2] b=r (η2L2(1 µ)2τ 2 max) ( αt + 2 1 α) (A2)2 =(η2L2(1 µ)2τ 2 max(B r)) ( αt + 2 1 α) (A2)2 2 α ( αt + 2 1 α) (A2)2 2 αt+1 + α 1 α) (A2)2, (21) where (b) is derived based on the definition of stable aggregation function, (c) is derived based on the worker momentum updating formula u µ u + (1 µ) f(w; zi), (d) is derived Yang and Li based on Cauchy s Inequality, (e) is derived based on Assumption 5, (f) is also derived based on Cauchy s Inequality, (g) is derived based on the induction hypothesis, (h) is derived based on that t tk τmax, and (i) is derived based on that α = 2η2L2τ 2 max(1 µ)2(B r). Therefore, E[ Gt 2 | wt] =E[||Gt syn + (Gt Gt syn)||2 | wt] (j) 2 E[ Gt syn 2 | wt] + 2 E[||Gt Gt syn||2 | wt] (k) 2 (A2)2 + 2 E[||Gt Gt syn||2 | wt] (l) 2 (A2)2 + 2 (1 2 αt+1 + α 1 α) (A2)2 =( αt+1 + 2 1 α) (A2)2, (22) where (j) is derived based on that x + y 2 2 x 2 + 2 y 2, x, y Rd, (k) is derived by the definition of (δmax, A1, A2)-effective aggregation function, and (l) is derived based on Inequality (21). By Inequality (21) and (22), the claimed property also holds for t = t. In conclusion, for all t = 0, 1, . . . , T 1, we have: E[ Gt Gt syn 2 | wt] (1 2 αt+1 + α 1 α) (A2)2, (23) and E[ Gt 2 | wt] ( αt+1 + 2 1 α) (A2)2. (24) Also, E[ Gt | wt]2 + V ar[ Gt | wt] = E[ Gt 2 | wt]. Therefore, E[ Gt | wt] = p E[ Gt | wt]2 αt+1 + 2 1 α A2. (25) η E[ F(wt)T Gt | wt] =η E[ F(wt)T Gt syn | wt] + η E[ F(wt)T (Gt Gt syn) | wt] (m) η ( F(wt) 2 A1) + η E[ F(wt)T (Gt Gt syn) | wt] η F(wt) 2 η A1 η F(wt) E[(Gt Gt syn) | wt] (n) η F(wt) 2 η A1 η D E[(Gt Gt syn) | wt] (p) η F(wt) 2 η A1 η D 1 2 αt+1 + α 1 α A2, (26) where (m) is derived based on the definition of (δmax, A1, A2)-effective aggregation function, (n) is derived by Assumption 3, and (p) is derived based on Inequality (23). Buffered Asynchronous SGD for Byzantine Learning Combining Inequalities (18), (24), (26) and taking total expectation, we have: E[F(wt+1)] E[F(wt)] η E[ F(wt) 2] + η A1 + η D 1 2 αt+1 + α 1 α A2 + 1 2η2L( αt+1 + 2 1 α) (A2)2. By telescoping, we have: t=0 E[ F(wt) 2] {F(w0) E[F(w T )]} + 1 2η2TL( α + 2 1 α) (A2)2 + ηTA1 + ηTD 1 2 α + α 1 α A2. Divide both sides of the equation by ηT, and let η = O( 1 PT 1 t=0 E[ F(wt) 2] {F(w0) E[F(w T )]} 2ηL( α + 2 1 α) (A2)2 + A1 + D 1 2 α + α 1 α A2 L[F(w0) F ] 2 α + 1 1 α) (A2)2 T + A1 + α 1 2 [ 3 α 2(1 α)] 1 2 DA2. Since η = O( 1 LT ) and B = r/δmax + 1, we have that α = 2η2L2τ 2 max(1 µ)2(B r) = O Lτ 2 max(1 µ)2(r δmaxr + 1) Finally, it is obtained that: PT 1 t=0 E[ F(wt) 2] L [F(w0) F ] L(A2)2(1 + α) + O α 1 2 DA2 + A1 L 1 2 [F(w0) F ] (r δmaxr + 1) 1 2 L 1 2 τmax DA2(1 µ) 1 2max T 1 2 L 1 2 (A2)2 (r δmaxr + 1)L 3 2 (A2)2τ 2 max(1 µ)2 Yang and Li B.7 Proof of Proposition 14 Proof Under the condition that wt Rd, E[ Gt syn F(wt) | wt] D < D, we have: E[ F(wt)T Gt syn | wt] = E[ F(wt)T [ F(wt) + (Gt syn F(wt))] | wt] = F(wt) 2 + E[ F(wt)T (Gt syn F(wt)) | wt] F(wt) 2 F(wt) E[ Gt syn F(wt) | wt] F(wt) 2 DD . Combining with the property (a) of (δmax, A1, A2)-effective aggregation function, we have A1 DD < D2. B.8 Proof of Theorem 15 Proof At the end of the t-th iteration, the server updates the global model by letting wt+1 = wt η Gt, where Gt = Aggr([h0, . . . , h B 1]). Please note that h0, . . . , h B 1 may differ for different t s. However, when it does not cause confusion, we will omit the superscript t for more readability. Let Ht = {b | hb does not contain Byzantine values} denote the set of credible buffer at the t-th iteration. Since there are at most r Byzantine workers, we have |Ht| B r. Let Gt = 1 |Ht| denote the mean value of credible buffers. At the beginning, we introduce the following lemma, which provides an upper bound for the difference between received local momentums and the global gradient. Lemma 19. Under the same conditions in Theorem 15, for each local momentum u stored in any credible buffer hb (b Ht) at the t-th iteration, we have: E[u F(wt)] 2 4(4cδ + 1) h 4 λ + 2 p 4 2λ + 4λ 2 + 2λ 2i η2L2(τmax + 1)2D2. (27) Proof In the proof below, all the mentioned local momentums and stochastic gradients are from a loyal worker k. For simplicity, we will omit the worker ID k when there is no confusion. Let tv (v = 0, 1, . . .) denote the iteration numbers corresponding to the model parameters that worker k received during the training process. Thus, we have 0 = t0 t1 . . . tv . . . and tv tv 1 τmax + 1. Therefore, the global model parameter is wtv when worker k sends local momentums for the v-th time. Let uk,v denote the v-th sent momentum from worker k. We are left to provide Buffered Asynchronous SGD for Byzantine Learning an upper bound for E[uk,v] E[ F(wtv)] 2. Using Assumption 2, Assumption 5 and the i.i.d.-ness, it is obtained that E[uk,v F(wtv)] 2 = µ E[uk,v 1] + (1 µ) E[ f(wtv 1; zik)] E[ F(wtv)] 2 = µ E[uk,v 1 F(wtv)] + (1 µ) E[ F(wtv 1) F(wtv)] 2 = (1 λ) E[uk,v 1 F(wtv)] + λ E[ F(wtv 1) F(wtv)] 2 (1 λ)2 E[uk,v 1 F(wtv)] 2 + λ2 E[ F(wtv 1) F(wtv)] 2 + 2λ(1 λ) E[uk,v 1 F(wtv)] E[ F(wtv 1) F(wtv)] . (28) By using Assumption 3, the L2-norm of each received momentum from loyal workers is bounded by D. Thus, for any b, b H, hb hb 2 4D2. Since tv tv 1 τmax, we have: E wtv 1 wtv 2 = E t=tv 1 η Gt t=tv 1 (Gt Gt) 2η2(τmax + 1)2cδ(4D2) + 2η2(τmax + 1)2D2 = 2(4cδ + 1)η2(τmax + 1)2D2. (29) The first term in the right-hand side (RHS) of inequality (28) is bounded by (1 λ)2 E[uk,v 1 F(wtv)] 2 =(1 λ)2 E[uk,v 1 F(wtv 1)] + E[ F(wtv 1) F(wtv)] 2 (1 λ)(1 + λ 2 ) E[uk,v 1 F(wtv 1)] 2 + (1 λ)(1 + 2 λ) E[ F(wtv 1) F(wtv)] 2 (1 λ)(1 + λ 2 ) E[uk,v 1 F(wtv 1)] 2 + (1 λ)(1 + 2 λ)E F(wtv 1) F(wtv) 2 2 ) E[uk,v 1 F(wtv 1)] 2 + 2L2 λ E wtv 1 wtv 2 2 ) E[uk,v 1 F(wtv 1)] 2 + 4(4cδ + 1)η2L2(τmax + 1)2D2 The second term in the RHS of inequality (28) is bounded by λ2 E[ F(wtv 1) F(wtv)] 2 λ2L2E wtv 1 wtv 2 2(4cδ+1)λ2η2L2(τmax+1)2D2. (31) Let constants Q = 2(4cδ + 1)η2L2(τmax + 1)2D2. Substituting (30) and (31) into (28), we have E[uk,v F(wtv)] 2 (1 λ 2 ) E[uk,v 1 F(wtv 1)] 2 + 2 + 2λQ 1 2 (1 λ 2 ) E[uk,v 1 F(wtv 1)] 2 + 2 Yang and Li Let ξv = E[uk,v F(wtv)] 0, we have (ξv 1)2 + 2 λQ + λ2Q + 2λ Q(ξv 1)2 + 2 By mathematical induction on v, then we prove that 4 + (4 2λ)λ2 Q, v N+. (34) Step 1. For v = 1, using Inequality (29), Assumption 2 and Assumption 5, we have ξ1 = E[uk,1 F(wt1)] = E[ f(wt0; zik) F(wt1)] = E[ F(wt0) F(wt1)] L E wt0 wt1 2(4cδ + 1)η2L2(τmax + 1)2D2 4 + (4 2λ)λ2 Step 2. Suppose that ξv 2 + Q holds for v. Then for v + 1, we have (ξv+1)2 1 λ λQ + λ2Q + 2λ 4 + (4 2λ)λ2 4 + (4 2λ)λ2 Since v u u t 4 + (4 2λ)λ2 Buffered Asynchronous SGD for Byzantine Learning 4 + (4 2λ)λ2 4 + (4 2λ)λ2 4 + (4 2λ)λ2 + 2 λ2 + 4 2λ + λ2 + 4 p 4 + (4 2λ)λ2 4 + (4 2λ)λ2 4λ + λ v u u t22 + 4 + (4 2λ)λ2 !2 + λ2 + 4 p 4 + (4 2λ)λ2 4 + (4 2λ)λ2 4 + (4 2λ)λ2 4 + (4 2λ)λ2 4 + (4 2λ)λ2 4 + (4 2λ)λ2 It indicates that the induction hypothesis also holds for v + 1. Consequently, for any positive integer v, we have 4 + (4 2λ)λ2 Recall that ξv = E[uk,v F(wtv)] and finally it is obtained that E[uk,v F(wtv)] 2 4 + (4 2λ)λ2 = 4(4cδ + 1) h 4 λ + 2 p 4 2λ + 4λ 2 + 2λ 2i η2L2(τmax + 1)2D2. (41) We have finished the proof of Lemma 19 and we will continue to prove Theorem 15. E[F(wt+1) | wt] =E[F(wt η Gt) | wt] (a) E F(wt) η F(wt)T Gt + L 2 η2||Gt||2 wt =E F(wt) η 1 F(wt) 2 + Gt 2 F(wt) Gt 2 + L 2 η2||Gt||2 wt Yang and Li 2 F(wt) 2 + η 2 F(wt) Gt 2 η(1 ηL) 2 ||Gt||2 wt (b) F(wt) η 2 F(wt) 2 + η 2 E h Gt F(wt) 2 wti (c) F(wt) η 2 F(wt) 2 + η E h Gt Gt 2 wti + η E h Gt F(wt) 2 wti , (42) where (a) is derived based on Assumption 5, (b) is derived based on that η 1 L, and (c) is derived based on that x + y 2 2 x 2 + 2 y 2. Take total expectation on both sides and it is obtained that E[F(wt+1)] E[F(wt)] η 2E F(wt) 2+η E Gt Gt 2+η E Gt F(wt) 2 . (43) Since the fraction of Byzantine workers is not larger than Bδ n , there are at most Bδ Byzantine workers. Thus, there are at most Bδ buffers that are not credible. It indicates that the fraction of credible buffers equals or is larger than 1 δ. Since Aggr( ) is a (δmax, c)-robust aggregation function and Gt = Aggr([h0, . . . , h B 1]), we have E Gt Gt 2 cδ max b,b Ht E hb hb 2 cδ max b,b Ht 2E hb F(wt) 2 + 2E hb F(wt) 2 4cδ max b Ht E hb F(wt) 2 . (44) In addition, since Gt = 1 |Ht| P b Ht hb, we have E Gt F(wt) 2 1 |Ht| b Ht E hb F(wt) 2 max b Ht E hb F(wt) 2 . (45) E[F(wt+1)] E[F(wt)] η 2E F(wt) 2 + η(4cδ + 1) max b Ht E hb F(wt) 2 . (46) For any b Ht, let ul (l = 1, 2, . . . , Nt b) denote the momentums received in hb. E hb F(wt) 2 l=1 ul F(wt) l=1 (ul E[ul]) E[ul] F(wt) 2 . (47) Buffered Asynchronous SGD for Byzantine Learning l {1, 2, . . . , Nt b}, suppose that ul is the momentum received from worker k for the v-th time. When t > τmax, we have v > 1. Thus, ul = uk,v = (1 λ)uk,v 1 + λ f(wtv 1; zik) where uk,v 1 denotes the stored momentum on worker k before f(wtv 1; zik) is computed and t τmax tv 1 t. By using Assumption 4, we have E ul E[ul] 2 = λ2 E f(wtv 1; zik) E[ f(wtv 1; zik)] 2 λ2σ2. (48) When 0 t τmax, it is uncertain whether v = 1 or not. If v > 1, it is already obtained that E ul E[ul] 2 λ2σ2 σ2. If v = 1, ul = uk,v = f(wtv 1; zik). Thus, E ul E[ul] 2 = E f(wtv 1; zik) E[ f(wtv 1; zik)] 2 σ2. (49) In summary, when 0 t τmax, we have E ul E[ul] 2 σ2. Therefore, l=1 (ul E[ul]) l=1 E ul E[ul] 2 σ2, if 0 t τmax; λ2σ2, if t > τmax. (50) Meanwhile, by Lemma 19, we have E[ul F(wt)] 2 4(4cδ + 1) h 4 λ + 2 p 4 2λ + 4λ 2 + 2λ 2i η2L2(τmax + 1)2D2. (51) Consequently, E hb F(wt) 2 σ2 + 4(4cδ + 1)(4 λ + 2 4 2λ + 4λ 2 + 2λ 2)η2L2(τmax + 1)2D2, t τmax; λ2σ2 + 4(4cδ + 1)(4 λ + 2 4 2λ + 4λ 2 + 2λ 2)η2L2(τmax + 1)2D2, t > τmax. (52) Substituting it into (46), it is obtained that if 0 t τmax, E[F(wt+1)] E[F(wt)] η 2E F(wt) 2 + η(4cδ + 1)σ2 + 4η(4cδ + 1)2 h 4 λ + 2 p 4 2λ + 4λ 2 + 2λ 2i η2L2(τmax + 1)2D2; and that if t > τmax, E[F(wt+1)] E[F(wt)] η 2E F(wt) 2 + η(4cδ + 1)λ2σ2 + 4η(4cδ + 1)2 h 4 λ + 2 p 4 2λ + 4λ 2 + 2λ 2i η2L2(τmax + 1)2D2. By taking summation over t, it is obtained that when T > τmax, E[F(w T )] F(w0) η t=0 E F(wt) 2 + η(4cδ + 1)σ2(τmax + 1 + λ2(T τmax 1)) Yang and Li + 4ηT(4cδ + 1)2 h 4 λ + 2 p 4 2λ + 4λ 2 + 2λ 2i η2L2(τmax + 1)2D2. Finally, by using E[F(w T )] F and T τmax 1 < T, we have PT 1 t=0 E|| F(wt)||2 T 2[F(w0) F ] ηT + 2(4cδ + 1)(τmax + 1)σ2 T + ζ, (56) where ζ = 2(4cδ + 1)λ2σ2 + 8(4cδ + 1)2 h 4 λ + 2 4 2λ + 4λ 2 + 2λ 2i η2L2(τmax + 1)2D2. B.9 Proof of Proposition 16 Proof Substituting λ = ηL and η q F(w0) F LT(4cδ+1)[σ2+8(4cδ+1)(τmax+1)2D2] into (2), it is obtained that PT 1 t=0 E|| F(wt)||2 2[F(w0) F ] ηT + 2(4cδ + 1)(τmax + 1)σ2 T + 2(4cδ + 1)λ2σ2 + 8(4cδ + 1)2 h 4 λ + 2 p 4 2λ + 4λ 2 + 2λ 2i η2L2(τmax + 1)2D2 2[F(w0) F ] ηT + 2(4cδ + 1)(τmax + 1)σ2 T + 2(4cδ + 1)λ2σ2 + 8(4cδ + 1)2 4 + 8λ 1 + 2λ 2 η2L2(τmax + 1)2D2 = 2[F(w0) F ] ηT + 2(4cδ + 1)[σ2 + 8(4cδ + 1)(τmax + 1)2D2]ηL + 2(4cδ + 1)(τmax + 1)σ2 + 64(4cδ + 1)2(τmax + 1)2D2η 3 2 L 3 2 + 32(4cδ + 1)2(τmax + 1)2D2η2L2 2L 1 2 [F(w0) F ] 1 2 (4cδ + 1) 1 2 [σ2 + 8(4cδ + 1)(τmax + 1)2D2] 1 2 T 1 2 , 2L[F(w0) F ] + 2L 1 2 [F(w0) F ] 1 2 (4cδ + 1) 1 2 [σ2 + 8(4cδ + 1)(τmax + 1)2D2] 1 2 + 2(4cδ + 1)(τmax + 1)σ2 + 64(4cδ + 1)2(τmax + 1)2D2 L[F(w0) F ] T(4cδ + 1)[σ2 + 8(4cδ + 1)(τmax + 1)2D2] + 32(4cδ + 1)2(τmax + 1)2D2 L[F(w0) F ] T(4cδ + 1)[σ2 + 8(4cδ + 1)(τmax + 1)2D2] 4L 1 2 [F(w0) F ] 1 2 (4cδ + 1) 1 2 [σ2 + 8(4cδ + 1)(τmax + 1)2D2] 1 2 Buffered Asynchronous SGD for Byzantine Learning + L 3 4 [F(w0) F ] 3 4 64(4cδ+1) 5 4 (τmax+1)2D2 [σ2+8(4cδ+1)(τmax+1)2D2] 3 4 + L[F(w0) F ] 2 + 32(4cδ+1)(τmax+1)2D2 σ2+8(4cδ+1)(τmax+1)2D2 + 2(4cδ + 1)(τmax + 1)σ2 4L 1 2 [F(w0) F ] 1 2 (4cδ + 1) 1 2 [σ2 + 8(4cδ + 1)(τmax + 1)2D2] 1 2 + 14L 3 4 [F(w0) F ] 3 4 (4cδ + 1) 1 2 (τmax + 1) 1 2 D 1 2 + 6L[F(w0) F ] + 2(4cδ + 1)(τmax + 1)σ2 B.10 Relation between Geometric Median and Centered Clipping Corollary 20. Aggregation rule centered clipping (CC) is equivalent to geometric median (geo Med) when clipping size R 0+. Proof The definition of CC is given by: hl+1 = hl + 1 b=0 (hb hl) min 1, R hb hl 2 When CC converges to h CC, it means that h CC = h CC + 1 b=0 (hb h CC) min 1, R hb h CC 2 Thus, we have: B 1 X b=0 (hb h CC) min 1, R hb h CC 2 When b {0, . . . , B 1}, R hb h CC 2 (since R 0+), we have min 1, R hb h CC 2 = R hb h CC 2 . (60) (hb h CC) hb h CC 2 = 0. (61) Yang and Li Considering that the function PB 1 b=0 h hb 2 is convex, we have: h CC = arg min h Rd = geo Med([h0, . . . , h B 1]). (63) Meanwhile, we have to point out that although CC is theoretically equivalent to geo Med when R is small enough, R is not supposed to be set too small in practical applications. Too small R will slow the convergence rate of CC. Buffered Asynchronous SGD for Byzantine Learning Appendix C. More Experimental Results Figure 9-10, Figure 11, and Figure 12 illustrate the average training loss w.r.t. epochs when under no attack, non-omniscient attacks, and omniscient attacks in the image classification task. Please note that in Figure 11 and Figure 12, some curves do not appear because the value of the loss function is extremely large due to the Byzantine attack. γ is the hyperparameter about the assumed number of Byzantine workers in Kardam. The experimental results further support the conclusions of this work. 0 20 40 60 80 100 120 140 160 Epoch Training Loss ASGD BASGD with median (B=5) BASGD with median (B=10) BASGD with median (B=15) BASGD with median (B=30) Kardam ( =2) Kardam ( =10) (a) BASGD with median 0 20 40 60 80 100 120 140 160 Epoch Training Loss ASGD BASGD with trmean (B=5, q=1) BASGD with trmean (B=10, q=3) BASGD with trmean (B=15, q=5) BASGD with trmean (B=30, q=10) Kardam ( =2) Kardam ( =10) (b) BASGD with trmean 0 20 40 60 80 100 120 140 160 Epoch Training Loss ASGD BASGD with geo Med (B=5) BASGD with geo Med (B=10) BASGD with geo Med (B=15) BASGD with geo Med (B=30) Kardam ( =2) Kardam ( =10) (c) BASGD with geo Med 0 20 40 60 80 100 120 140 160 Epoch Training Loss ASGD BASGD with CC (B=5) BASGD with CC (B=10) BASGD with CC (B=15) BASGD with CC (B=30) Kardam ( =2) Kardam ( =10) (d) BASGD with CC Figure 9: Average training loss w.r.t. epochs of methods BASGD, ASGD, and Kardam when there are no Byzantine workers. Yang and Li 0 20 40 60 80 100 120 140 160 Epoch Training Loss ASGDm BASGDm with median (B=5) BASGDm with median (B=10) BASGDm with median (B=15) BASGDm with median (B=30) Kardam ( =2) Kardam ( =10) (a) BASGDm with median 0 20 40 60 80 100 120 140 160 Epoch Training Loss ASGDm BASGDm with trmean (B=5, q=1) BASGDm with trmean (B=10, q=3) BASGDm with trmean (B=15, q=5) BASGDm with trmean (B=30, q=10) Kardam ( =2) Kardam ( =10) (b) BASGDm with trmean 0 20 40 60 80 100 120 140 160 Epoch Training Loss ASGDm BASGDm with geo Med (B=5) BASGDm with geo Med (B=10) BASGDm with geo Med (B=15) BASGDm with geo Med (B=30) Kardam ( =2) Kardam ( =10) (c) BASGDm with geo Med 0 20 40 60 80 100 120 140 160 Epoch Training Loss ASGDm BASGDm with CC (B=5) BASGDm with CC (B=10) BASGDm with CC (B=15) BASGDm with CC (B=30) Kardam ( =2) Kardam ( =10) (d) BASGDm with CC Figure 10: Average training loss w.r.t. epochs of methods BASGDm, ASGDm, and Kardam when there are no Byzantine workers. Buffered Asynchronous SGD for Byzantine Learning 0 20 40 60 80 100 120 140 160 Epoch Average Training Loss BASGD with CC BASGDm with CC BASGD with trmean (q=3) BASGDm with trmean (q=3) BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =6) (a) 3 Byzantine workers with RD-attack 0 20 40 60 80 100 120 140 160 Epoch Average Training Loss BASGD with CC BASGDm with CC BASGD with trmean (q=3) BASGDm with trmean (q=3) BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =8) Kardam ( =14) (b) 3 Byzantine workers with NG-attack 0 20 40 60 80 100 120 140 160 Epoch Average Training Loss BASGD with CC BASGDm with CC BASGD with trmean (q=6) BASGDm with trmean (q=6) BASGD with geo Med BASGDm with geo Med Kardam ( =6) Kardam ( =10) (c) 6 Byzantine workers with RD-attack 0 20 40 60 80 100 120 140 160 Epoch Average Training Loss BASGD with CC BASGDm with CC BASGD with trmean (q=6) BASGDm with trmean (q=6) BASGD with geo Med BASGDm with geo Med Kardam ( =6) Kardam ( =10) Kardam ( =14) (d) 6 Byzantine workers with NG-attack Figure 11: Average training loss w.r.t. epochs under non-omniscient attacks. B = 10 for BASGD (BASGDm) when there are 3 Byzantine workers and B = 15 for BASGD (BASGDm) when there are 6 Byzantine workers. Some curves do not appear in the figure, because the value of loss function is extremely large. Yang and Li 0 20 40 60 80 100 120 140 160 Epoch Average Training Loss BASGD with CC BASGDm with CC BASGD with trmean BASGDm with trmean BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =6) (a) 3 Byzantine workers with Fo E attack 0 20 40 60 80 100 120 140 160 Epoch Average Training Loss BASGD with CC BASGDm with CC BASGD with trmean BASGDm with trmean BASGD with geo Med BASGDm with geo Med Kardam ( =3) Kardam ( =6) (b) 3 Byzantine workers with ALIE attack 0 20 40 60 80 100 120 140 160 Epoch Average Training Loss BASGD with CC BASGDm with CC BASGD with trmean BASGDm with trmean BASGD with geo Med BASGDm with geo Med Kardam ( =10) Kardam ( =14) (c) 6 Byzantine workers with Fo E attack 0 20 40 60 80 100 120 140 160 Epoch Average Training Loss BASGD with CC BASGDm with CC BASGD with trmean BASGDm with trmean BASGD with geo Med BASGDm with geo Med Kardam ( =6) Kardam ( =10) (d) 6 Byzantine workers with ALIE attack Figure 12: Average training loss w.r.t. epochs under omniscient attacks. B = 10 for BASGD (BASGDm) when there are 3 Byzantine workers and B = 15 for BASGD (BASGDm) when there are 6 Byzantine workers. Some curves do not appear in the figure, because the value of loss function is extremely large. Buffered Asynchronous SGD for Byzantine Learning Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 4613 4623, 2018. Zeyuan Allen-Zhu, Faeze Ebrahimian, Jerry Li, and Dan Alistarh. Byzantine-resilient non-convex stochastic gradient descent. ar Xiv preprint ar Xiv:2012.14368, 2020. Mahmoud Assran, Arda Aytekin, Hamid Reza Feyzmahdavian, Mikael Johansson, and Michael G Rabbat. Advances in asynchronous parallel and distributed optimization. Proceedings of the IEEE, 108(11):2013 2031, 2020. Gilad Baruch, Moran Baruch, and Yoav Goldberg. A little is enough: Circumventing defenses for distributed learning. In Advances in Neural Information Processing Systems, pages 8635 8645, 2019. Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. sign SGD with majority vote is communication efficient and fault tolerant. In Proceedings of the International Conference on Learning Representations, 2019. Dimitri Bertsekas, P Tsitsiklis, and N John. Parallel and distributed computation: Numeral methods. Prentice-Hall Inc., 1989. Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pages 119 129, 2017. L eon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of the International Conference on Computational Statistics, pages 177 186. Springer, 2010. Lingjiao Chen, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. Draco: Byzantine-resilient distributed training via redundant gradients. In Proceedings of the International Conference on Machine Learning, pages 903 912, 2018. Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1 25, 2017. Georgios Damaskinos, Rachid Guerraoui, Rhicheek Patra, Mahsa Taziki, et al. Asynchronous Byzantine machine learning (the case of SGD). In Proceedings of the International Conference on Machine Learning, pages 1145 1154, 2018. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223 1231, 2012. Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust statistics. ar Xiv preprint ar Xiv:1911.05911, 2019. Yang and Li Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proceedings of the International Conference on Machine Learning, pages 999 1008, 2017. Xiaomin Duan, Huafei Sun, and Linyu Peng. Application of gradient descent algorithms based on geodesic distances. Science China Information Sciences, 63:1 11, 2020. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul): 2121 2159, 2011. El-Mahdi El-Mhamdi, Sadegh Farhadkhani, Rachid Guerraoui, Arsany Guirguis, Lˆe-Nguyˆen Hoang, and S ebastien Rouault. Collaborative learning in the jungle (decentralized, Byzantine, heterogeneous, asynchronous and nonconvex learning). In Advances in Neural Information Processing Systems, pages 25044 25057, 2021a. El-Mahdi El-Mhamdi, Rachid Guerraoui, and S ebastien Rouault. Distributed momentum for byzantine-resilient stochastic gradient descent. In Proceedings of the International Conference on Learning Representations, 2021b. Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Trading redundancy for communication: Speeding up distributed SGD for non-convex optimization. In Proceedings of the International Conference on Machine Learning, pages 2545 2554, 2019. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997. Martin Jaggi, Virginia Smith, Martin Tak ac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 3068 3076, 2014. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013. Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends R in Machine Learning, 14(1 2):1 210, 2021. Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Learning from history for Byzantine robust optimization. In Proceedings of the International Conference on Machine Learning, pages 5311 5319, 2021. Buffered Asynchronous SGD for Byzantine Learning Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. Byzantine-robust learning on heterogeneous datasets via bucketing. In Proceedings of the International Conference on Learning Representations, 2022. Jakub Konevcn y, H Brendan Mc Mahan, Felix X Yu, Peter Richt arik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. ar Xiv:1610.05492, 2016. Konstantinos Konstantinidis and Aditya Ramamoorthy. Byzshield: An efficient and robust system for distributed training. Proceedings of Machine Learning and Systems, 3:812 828, 2021. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009. Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generals problem. In Concurrency: the works of leslie lamport, pages 203 226, 2019. Jason D Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. The Journal of Machine Learning Research, 18(1):4404 4446, 2017. Liping Li, Wei Xu, Tianyi Chen, Georgios B Giannakis, and Qing Ling. RSA: Byzantinerobust stochastic aggregation methods for distributed learning from heterogeneous datasets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1544 1551, 2019. Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19 27, 2014. Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330 5340, 2017. Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems, pages 3059 3067, 2014. Ji Liu and Ce Zhang. Distributed learning systems with first-order methods. ar Xiv preprint ar Xiv:2104.05245, 2021. Chenxin Ma, Virginia Smith, Martin Jaggi, Michael Jordan, Peter Richt arik, and Martin Tak ac. Adding vs. averaging in distributed primal-dual optimization. In Proceedings of the International Conference on Machine Learning, pages 1973 1982, 2015. John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Mike Rabbat, Mani Malek, and Dzmitry Huba. Federated learning with buffered asynchronous aggregation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 3581 3607, 2022. Yang and Li Matthew Nokleby, Haroon Raja, and Waheed U Bajwa. Scaling-up distributed processing of data streams for machine learning. ar Xiv preprint ar Xiv:2005.08854, 2020. Jungwuk Park, Dong-Jun Han, Minseok Choi, and Jaekyun Moon. Sageflow: Robust federated learning against both stragglers and adversaries. In Advances in Neural Information Processing Systems, pages 840 851, 2021. Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. Robust aggregation for federated learning. ar Xiv preprint ar Xiv:1912.13445, 2019. Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145 151, 1999. Shashank Rajput, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. Detox: A redundancy-based framework for faster and more robust gradient aggregation. In Advances in Neural Information Processing Systems, pages 10320 10330, 2019. Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83 112, 2017. Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Annual Conference of the International Speech Communication Association, 2014. Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567 599, 2013. Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In Proceedings of the International Conference on Machine Learning, pages 1000 1008, 2014. Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. Edge computing: Vision and challenges. IEEE Internet of Things Journal, 3(5):637 646, 2016. Jy-yong Sohn, Dong-Jun Han, Beongjun Choi, and Jaekyun Moon. Election coding for distributed learning: Protecting signsgd against Byzantine attacks. In Advances in Neural Information Processing Systems, pages 14615 14625, 2020. Shizhao Sun, Wei Chen, Jiang Bian, Xiaoguang Liu, and Tie-Yan Liu. Slim-dp: a multi-agent system for communication-efficient distributed deep learning. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pages 721 729, 2018. Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, and Dimitris Papailiopoulos. Attack of the tails: Yes, you really can backdoor federated learning. In Advances in Neural Information Processing Systems, pages 16070 16084, 2020. Buffered Asynchronous SGD for Byzantine Learning Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1299 1309, 2018. Zhaoxian Wu, Qing Ling, Tianyi Chen, and Georgios B Giannakis. Federated variancereduced stochastic gradient descent with robustness to Byzantine attacks. IEEE Transactions on Signal Processing, 68:4583 4596, 2020. Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543 2596, 2010. Cong Xie, Sanmi Koyejo, and Indranil Gupta. Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. In Proceedings of the International Conference on Machine Learning, pages 6893 6901, 2019. Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Fall of empires: Breaking Byzantinetolerant sgd by inner product manipulation. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages 261 270, 2020a. Cong Xie, Sanmi Koyejo, and Indranil Gupta. Zeno++: Robust fully asynchronous SGD. In Proceedings of the International Conference on Machine Learning, pages 10495 10503, 2020b. Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 629 637, 2013. Yi-Rui Yang and Wu-Jun Li. BASGD: Buffered asynchronous SGD for Byzantine learning. In Proceedings of the International Conference on Machine Learning, pages 11751 11761, 2021. Zhixiong Yang, Arpita Gang, and Waheed U Bajwa. Adversary-resilient distributed and decentralized statistical inference and machine learning: An overview of recent advances under the Byzantine threat model. IEEE Signal Processing Magazine, 37(3):146 159, 2020. Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In Proceedings of the International Conference on Machine Learning, pages 5650 5659, 2018. Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Defending against saddle point attack in Byzantine-robust distributed learning. In Proceedings of the International Conference on Machine Learning, pages 7074 7084, 2019. Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Proceedings of the International Conference on Machine Learning, pages 7184 7193, 2019a. Yang and Li Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5693 5700, 2019b. Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pages 980 988, 2013. Ruiliang Zhang and James Kwok. Asynchronous distributed admm for consensus optimization. In Proceedings of the International Conference on Machine Learning, pages 1701 1709, 2014. Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao, and Wu-Jun Li. SCOPE: scalable composite optimization for learning on spark. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2928 2934, 2017. Shen-Yi Zhao, Gong-Duo Zhang, Ming-Wei Li, and Wu-Jun Li. Proximal SCOPE for distributed sparse learning. In Advances in Neural Information Processing Systems, pages 6551 6560, 2018. Shen-Yi Zhao, Yin-Peng Xie, and Wu-Jun Li. On the convergence and improvement of stochastic normalized gradient descent. Science China Information Sciences, 64:1 13, 2021. Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhi-Ming Ma, and Tie-Yan Liu. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the International Conference on Machine Learning, pages 4120 4129, 2017. Yi Zhou, Yingbin Liang, Yaoliang Yu, Wei Dai, and Eric P Xing. Distributed proximal gradient algorithm for partially asynchronous computer clusters. The Journal of Machine Learning Research, 19(1):733 764, 2018. Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 2595 2603, 2010.