# secure_distributed_training_at_scale__e755af82.pdf Secure Distributed Training at Scale Eduard Gorbunov * 1 2 3 Alexander Borzunov * 4 3 Michael Diskin 4 3 Max Ryabinin 4 3 Many areas of deep learning benefit from using increasingly larger neural networks trained on public data, as is the case for pre-trained models for NLP and computer vision. Training such models requires a lot of computational resources (e.g., HPC clusters) that are not available to small research groups and independent researchers. One way to address it is for several smaller groups to pool their computational resources together and train a model that benefits all participants. Unfortunately, in this case, any participant can jeopardize the entire training run by sending incorrect updates, deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server, making it infeasible to apply them to large-scale deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency. 1. Introduction Many hard scientific problems were solved through collaboration between many nations, groups and individuals. This is especially evident in natural sciences, where researchers formed multinational collaborations to run large-scale experiments and share compute infrastructure (Aad et al., 2012; Ruttley et al., 2017; Abbott et al., 2016). Projects like Folding@home (Beberg et al., 2009) and BOINC (Anderson, 2004) push this trend even further by recruiting volunteers that donate their compute to collectively run computational experiments at an unprecedented scale (Merritt, 2020). *Equal contribution 1MIPT 2Mila Quebec AI Institute 3Yandex 4HSE University. Correspondence to: Eduard Gorbunov , Alexander Borzunov . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). Recently, similar techniques were proposed for deep learning. They aim to solve the challenges caused by the sheer computational complexity of many machine learning tasks, such as pretraining transformers for NLP (Devlin et al., 2019; Brown et al., 2020; Liu et al., 2019) or learning on huge datasets in vision (Sun et al., 2017; Kolesnikov et al., 2020; Goyal et al., 2021). Recent works (Kijsipongse et al., 2018; Ryabinin & Gusev, 2020; Atre et al., 2021; Diskin et al., 2021) propose several systems sharing the computation across many volunteers that donate the idle time of their computers to train large models on public datasets. Despite their strengths, volunteer computing systems have so far seen limited practical applications (Kijsipongse et al., 2018). A major roadblock towards the global adoption of these techniques is trust in reliability of each participant. For distributed training, all progress made by the collaboration can be undermined if a single peer sends incorrect outputs due to an error in computation (Smith, 2019) or malicious intent (Tolpegin et al., 2020). Prior art in decentralized optimization proposed several optimization algorithms that are resistant to such Byzantine faults. However, most Byzantine-tolerant training protocols require either passing all updates through a trusted central server or exchanging additional messages that increase the network load by several times (Chen et al., 2018; Rajput et al., 2019). This is a major problem for large-scale distributed deep learning, where hundreds of peers must exchange updates for millions of parameters at regular intervals (Li et al., 2020; Sergeev & Balso, 2018; Shoeybi et al., 2019). Thus, in many practical scenarios, the computation and communication overhead of Byzantine-tolerant algorithms outweighs the benefits of collaborating with others. In this work, we set out to solve this problem by proposing a novel Byzantine-tolerant distributed training protocol designed for large-scale deep learning workloads. Our approach combines the scalability and communication efficiency of modern distributed training techniques such as All-Reduce SGD (Sergeev & Balso, 2018) with resilience against Byzantine and Sybil attackers. To achieve this, we leverage cryptographic techniques to verify the integrity of training with minimal overhead that does not depend on the model size. Our protocol does not require any specific peers to be trusted. Secure Distributed Training at Scale Our contributions can be summarized as follows: We propose a novel protocol for decentralized Byzantinetolerant training on data available to all participants, where the extra communication cost does not depend on the number of parameters. We rigorously analyze this protocol and prove convergence bounds for convex and non-convex losses with Byzantine attackers. Furthermore, we derive accelerated convergence rates for the same task under realistic assumptions about model gradients. We propose a heuristic for resisting Sybil attacks from computationally constrained attackers, allowing to accept new untrusted peers joining midway through training. We verify the effectiveness of our algorithm in controlled experiments1 and actual large-scale training runs. Specifically, we start with Res Net-18 for CIFAR-10 classification and follow up with pretraining ALBERT-large in a setup where almost a half of all peers are malicious. 2. Related work 2.1. Distributed deep learning Training modern neural networks often requires the amount of computation that is infeasible to achieve on any single machine. One has to train such models on multiple machines using methods for distributed training. Most of these methods fall into two groups: in data-parallel training, each worker trains the entire model by sampling batches from the training data (Sergeev & Balso, 2018; Goyal et al., 2017); in contrast, model-parallel training allocates parts of the model on different workers (Huang et al., 2019; Narayanan et al., 2019; Shoeybi et al., 2019). In this study, we consider only the first group; notably, most model-parallel systems still rely on data parallelism between nodes at the same stage (Rajbhandari et al., 2020; Narayanan et al., 2021). Usually, data-parallel training consists of two phases: first, each worker computes the gradients over its data; then, all workers aggregate the gradients and run an SGD step. The simplest aggregation strategy is known as Parameter Servers (PS) (Li, 2014; Dean et al., 2012; Recht et al., 2011): one of the servers stores and updates the model parameters, while all others iteratively compute the gradients, send them to the PS, and download the updated parameters. This strategy can be quite efficient with a small number of workers; as it increases, the parameter server eventually becomes unable to handle the load. While gradient compression (Seide et al., 2014; Lin et al., 2018; Mishchenko et al., 2019; Koloskova et al., 2020; Gorbunov et al., 2020; 2021) and local up- 1Source code for the experiments is available at https://github.com/yandex-research/btard Figure 1. A scheme of Butterfly All-Reduce (Li et al., 2017). Each peer transfers only O(d) data when averaging a vector of size d. dates (Zinkevich et al., 2010) partially address this issue, it still remains a bottleneck of such methods. In practice, most distributed training systems leverage All Reduce (AR) (Goyal et al., 2017; Mikami et al., 2019; You et al., 2020) a family of collective communication protocols that allow servers to average their data and receive the result on each machine. The resulting method, named All-Reduce SGD (AR-SGD), runs AR on local gradients of each peer to compute the global average. Usually, AR-SGD uses bandwidth-optimal versions of All-Reduce (Sergeev & Balso, 2018; Patarasuk & Yuan, 2009), such as Butterfly All Reduce (see Figure 1). Depending on the exact algorithm, they require each peer to transfer only O(d) or O(d log n) data when averaging a vector of size d across n peers (unlike the PS-based approaches, where PS transfers O(dn) data). 2.2. Byzantine-tolerant optimization Standard distributed training methods are not robust against Byzantine attacks. In the vanilla parallel SGD, one malicious worker can break the convergence of the whole method by shifting the mean of the resulting vector in an arbitrary way. Therefore, the research community invented special algorithms that can train models even in this setup. Parameter-server (PS) based approaches. Most of the algorithms designed to be Byzantine-resilient rely on the existence of a trusted parameter server. In such approaches, the standard mean estimator, e.g., the one used in parallel SGD, is typically replaced with a more robust aggregation rule (Blanchard et al., 2017; Yin et al., 2018; Damaskinos et al., 2019; El Mhamdi et al., 2018; Pillutla et al., 2019). However, recent works show that it is not enough by proposing special types of Byzantine attacks (Baruch et al., 2019; Xie et al., 2020) and showing that permutation-invariant algorithms cannot converge to any predefined accuracy of the solution (Karimireddy et al., 2020). Although there are several approaches aiming to circumvent this issue, most of them have significant limitations such as no convergence analysis (Chen et al., 2018; Rajput et al., 2019; Rodríguez-Barroso et al., 2020; Xu & Lyu, 2020), too restrictive assumptions in the analysis (Alistarh Secure Distributed Training at Scale et al., 2018; Allen-Zhu et al., 2021; Regatti et al., 2020), or the usage of variance-reduced estimators (Wu et al., 2020), which are known to converge slowly in deep learning applications (Defazio & Bottou, 2019). The only paper without such limitations is Karimireddy et al. (2020): it proposes a new aggregation rule called CENTEREDCLIP, applies it to SGD with client momentum, and proves convergence results for the obtained method in the non-convex case under reasonable assumptions. We provide more details on Byzantine-tolerant PS-based approaches in Appendix A.1.1. Decentralized approaches for Byzantine-tolerant optimization are studied only in a few papers. Unfortunately, the known approaches are not well-suited for distributed deep learning since they either rely on full gradient computations (Yang & Bajwa, 2019a;b), or use redundant communications with multiple servers (El-Mhamdi et al., 2020), or require peer-to-peer communication of full vectors at each step (Gupta et al., 2021; Gupta & Vaidya, 2021), which is not scalable, or provide the convergence guarantees that are inferior to non-parallel SGD (Peng et al., 2021), which has prohibitively slow convergence on modern deep learning tasks. We defer further details to Appendix A.1.2. 2.3. Security in distributed systems Message propagation protocols. In this work, we consider distributed systems relying exclusively on peer-to-peer connections (e.g., the ones working over the Internet). Several key stages of our algorithm require peers to broadcast small messages to all other peers. For the sake of consistency, if at least one honest peer receives a message, we expect all other honest peers to eventually receive it as well. A naive solution would be for all peers to relay each previously unseen message to all other peers. In this case, for n peers and a b-bit message, one all-to-all broadcast would require each peer to transfer O(n2b) data. To improve efficiency, we use Gossip Sub (Vyzovitis et al., 2020) that reduces this cost to O(nb) data per peer by relaying each message to only D carefully chosen neighbors, where D is a constant chosen based on latency requirements. Digital signatures. Our approach relies on the fact that an attacker cannot impersonate an honest peer or change messages an honest peer broadcasts. To achieve that, we require all peers to declare their public keys and sign all their messages with digital signatures (Rivest et al., 1978). Multi-party random number generator. To ensure that peers compute gradients honestly, our approach verifies a random subset of all computed gradients. Thus, we need to choose who is going to be checked in such a way that the attackers can neither predict nor influence the random draw. This can be done with a multi-party random number generator (MPRNG) based on the coin tossing protocol from Blum (1983). We explain the full algorithm in Appendix A.2. The algorithm requires each peer to only broadcast 3 scalars, so its communication cost is O(n) data per peer. We consider secure distributed training on public datasets, where each peer can access the entire training data and communicate with any other peer. In this scenario, multiple parties cooperate by combining their computational resources for a single large-scale training run. Specifically, we consider a data-parallel training setup with All-Reduce SGD (as described in Section 2.1), where peers aggregate their gradient vectors of size d. We describe our strategy in several stages: Section 3.1 outlines our approach for Byzantine-Tolerant All-Reduce (BTARD). In Section 3.2, we formulate the underlying optimization problem and derive its convergence bounds. In Section 3.3, we propose a heuristic for resisting Sybil attacks, allowing our system to accept new untrusted peers midway through training. 3.1. Byzantine-Tolerant All-Reduce We assume that some workers can be malicious, i.e., they can arbitrarily deviate from our algorithm: for instance, send arbitrary vectors instead of stochastic gradients or violate the communication protocol. Such workers are called Byzantine nodes or just Byzantines. We assume them to be omniscient (Karimireddy et al., 2020) (except for the honest nodes private keys and the internals of MPRNG) and able to collude with each other. We denote the set of all good workers as G and the set of Byzantine workers as B. We further assume that B is fixed throughout the optimization process, and less than a half of the nodes are Byzantine: |B| δn, where δ [0, 1/2). Finally, we assume that all workers have access to the data defining the objective function, sampling minibatches from the full dataset.2 We design our algorithm in such a way that all types of Byzantine faults have limited effect and chance of being discovered. To limit the damage over a single SGD step, we modify Butterfly All-Reduce3 (see Figure 1) with a robust aggregation technique known as CENTEREDCLIP (Karimireddy et al., 2020). We apply CENTEREDCLIP to each 2He et al. (2020) show that it is impossible to achieve any predefined accuracy of the solution without this assumption, i.e., in the heterogeneous case (see discussion in Appendix E.2). 3We choose Butterfly All-Reduce so that peers aggregate nonoverlapping parts of the gradient vector. This helps to identify the attacker if the gradients are aggregated incorrectly. Jiang et al. (2020) report that Butterfly All-Reduce is near-optimal for distributed training over high-latency networks such as the Internet. Secure Distributed Training at Scale I E R Outliers are Verify that Centered Clip was performed correctly using random projections to Figure 2. A scheme illustrating one step of Byzantine-Tolerant All-Reduce a part of Algorithm 1 executed between the consecutive SGD steps. Here, t is the step number, xt is the model weights, and ξt i is a publicly known random seed for sampling a minibatch. partition of the gradient vector instead of naive averaging. We denote this procedure as BUTTERFLYCLIP (see Algorithm 2 for its formal description). However, Byzantine peers can circumvent this limit by attacking over many iterations. To protect against this, BTARD periodically chooses random peers to serve as validators. The validators must recalculate the gradients of other peers and report any discrepancies instead of computing their own gradients. Since such tests are effective only if the attackers cannot predict whether they will be validated, we use a multi-party random number generator (as described in Section 2.3) to choose the validated peers. After each training step, peers use MPRNG to choose m validators and m peers to validate (each validator checks one peer). The Byzantines cannot predict safe iterations before they commit to an attack. Thus, more frequent attacks (with greater total damage) are more likely to be detected by an honest validator. Since validators can also be malicious, BTARD uses a special ACCUSE procedure to detect false reports. Before each averaging round, peers broadcast hashes of their gradients using Gossip Sub4 (line 2 of Algorithm 2). Then, if validator i accuses peer j of modifying gradients, all other peers will be able to recalculate j s gradients and compare their hash against the broadcasted one. If the peers find that j s gradients are correct, peer i is banned instead (Hammurabi & Harper, 1904). This procedure is described in Algorithm 3. The resulting algorithm is resilient to attacks made through incorrect gradients. However, malicious peers may also harm training by violating the CENTEREDCLIP procedure for the portion of gradients they are aggregating. Fortunately, we can design a test through which peers can verify that a vector they received is indeed the output of CENTEREDCLIP. We need to view CENTEREDCLIP as a fixed-point iteration 4We assume that peers declare their public key when joining and sign all broadcasted messages with the respective private key. Any peer broadcasting contradicting messages (e.g., different gradient hashes) should be banned, since it could break the eventual consistency (other peers may receive them in a different order). for the equation (see details in Appendix D.2): i=1 ( gi x) min 1, τ gi x The workers are not able to test whether (1) holds directly, since collecting gi would require sending O(dn) data per peer, defeating the purpose of our algorithm. Instead, workers should use the MPRNG output to sample a random direction z in the space of model gradients. Then, each peer computes and broadcasts the inner product (2): si= z, ( gi x) min 1, τ gi x Finally, all peers can verify that Pn i=1 si = 0. Similarly to our previous use of MPRNG, all aggregators must broadcast the hashes of their aggregation results (line 6 of Alg. 2) before they learn z. This ensures that a malicious aggregator cannot modify the results in such a way that the difference would be orthogonal to z (this and more complex attack vectors are analyzed in Appendices C and D.5). We combine all these procedures in Algorithm 1 (see its scheme in Figure 2 and its detailed version in Alg. 6 7). Crucially, one step of Algorithm 1 requires each peer to (a) receive and send n partitions of the gradient vector (exactly as in Butterfly All-Reduce), (b) broadcast O(n) scalar values (all hashes, the inner products sj i, and the necessary accusations), and (c) run MPRNG once. According to Sections 2.1 and 2.3, the total communication cost of these procedures is O(d + n2) data per peer. This is close to O(d) cost of the bandwidth-optimal versions of All-Reduce: the O(n2) extra cost is usually much smaller than O(d) for models that benefit from distributed training. The algorithm s synchronization points and computational overhead are reviewed in Appendix B. 3.2. Convergence analysis From the perspective of the optimization theory, our task is the expectation minimization problem: min x Q Rd {f(x) := Eξ D [f(x, ξ)]} (3) Secure Distributed Training at Scale Algorithm 1 BTARD-SGD for peer i (informal) Input: rank i, model x0, seed ξ0 i , step count T, peer count n 1: for t 0, . . . , T 1 do 2: gi = COMPUTEGRADIENTS(xt, ξt i) 3: ˆg = BUTTERFLYCLIP(i, gi) 4: rt = MPRNG() 5: z = GETRANDOMVECTOR(rt) 6: for j 1, . . . , n do 7: // ˆg[j] is the aggregated part from peer j 8: j i=(gi[j] ˆg[j]) min n 1, τ gi[j] ˆg[j] 2 9: broadcast sj i = z[j], j i 10: for j 1, . . . , n do 11: // We know i j from CENTEREDCLIP 12: if si j = z[j], i j then 13: broadcast si j is wrong // Invokes Alg. 3 14: if Pn t sj t = 0 then 15: // Peer j lied that all sj are correct 16: broadcast ˆg[j] is wrong // Invokes Alg. 3 17: xt+1 = SGDSTEP(xt, ˆg) 18: ξt+1 i = hash(rt||i) 19: if i CHOOSEVALIDATORS(rt) then 20: j = CHOOSETARGET(rt, i) 21: VALIDATEPEER(j, xt, ξt j, cj, h* j, s* j) 22: // ... instead of computing gradients for step t+1 23: return x T Here, the objective function f is smooth and uniformly lower bounded, Q Rd is a closed convex set of admissible parameters and ξ is the source of stochasticity, such as minibatch indices. We assume that the problem (3) can be solved in a distributed manner, i.e., one can use n workers calculating (mini-batched) stochastic gradients in parallel and communicating according to some protocol. We denote the set of workers as [n] := {1, 2, . . . , n} = G B. There are many ways for Byzantines to affect the training. We can classify all of them into four categories: (a) gradient attacks, where Byzantines modify their gk i , but otherwise behave normally; (b) aggregation attacks, where a malicious aggregator returns wrong ˆgi and relies on others to cover it up by misreporting si; (c) reputation attacks, such as slander via false ACCUSE(i, j, ); and (d) protocol violations, that is, any other deviations from the steps of Algorithm 1 (e.g., refusing to send data within a predefined timeout). We elaborate on each attack type in Appendix C. For the purpose of this analysis, the latter two attacks can be repelled with an extra policy that allows an active worker to ELIMINATE any other worker at the cost of also being banned. If peer i encounters a protocol violation from peer j, it broadcasts a message asking to remove both peers i and j from training. The design of this policy ensures that every Algorithm 2 BUTTERFLYCLIP for peer i Input: rank i, gradients gi Rd 1: gi[1], ..., gi[n] = SPLIT(gi, n) 2: broadcast j, hj i = hash(gi[j]) 3: send j, gi[j] peerj 4: receive j, gj[i] peerj // and verify against hi j 5: ˆgi = CENTEREDCLIP(g1[i], ..., gn[i]) 6: broadcast ˆhi = hash(ˆgi) 7: send j, ˆgi peerj 8: receive j, ˆgj peerj // and verify against ˆhj 9: return MERGE(ˆg1, ..., ˆgn) Algorithm 3 ACCUSE (i, j), invoked on all peers Input: accuser i, target j 1: gj = COMPUTEGRADIENTS(xt, ξt j) 2: if k : (hash(gj[k]) = hk j 3: or sk j = z[k], k j ) or Pn k=1 sj k =0 then 4: BAN(peerj) // and everyone who covered it up 5: else 6: BAN(peeri) such message, whether sent by honest or Byzantine peers, eliminates at least 1 Byzantine peer and at most 1 honest peer (see details in Appendix D.3). Thus, if a Byzantine minority uses this against honest peers, it will only decrease their relative numbers: (δn 1)/(n 2) < δ. This leaves us only with the attacks targeting the aggregated gradients. We provide convergence guarantees for variants of BTARDSGD with Q = Rd under different sets of assumptions about the function f and its stochastic gradients. Our first two setups assume that: Assumption 3.1. There exist such constant σ 0, s0 [d] that for any set of indices S = (i1, . . . , is), 1 i1 < i2 < . . . < is d, s s0 stochastic gradient f(x, ξ) satisfy E[ f(x, ξ)] = f(x), E h [S]f(x, ξ) [S]f(x) 2i sσ2 where [S]f(x, ξ) = ( i1f(x, ξ), . . . , isf(x, ξ)) , [S]f(x) = ( i1f(x), . . . , isf(x)) , and fj(x, ξ), jf(x) are j-th components of f(x, ξ) and f(x) respectively. Here, (4) is an extension of the classical uniformly bounded variance (UBV) assumption (Nemirovski et al., 2009; Ghadimi & Lan, 2012; 2013) ensuring that the noise in all subvectors of large enough dimension has the variance dependent on the ratio between the dimension of the subvector s and the dimension of the full vector d. For example, it holds when the noise is isotropic. Moreover, one can relax this assumption to the standard UBV assumption, if Secure Distributed Training at Scale Table 1. Summary of complexity bounds for BTARD-SGD in different scenarios. By complexity we mean the number of iterations sufficient to find such point bx that E[ f(bx) 2] ε2 for non-convex problems and E[f(bx) f(x )] ε for convex and µ-strongly convex problems (see Def. E.2) with x being the solution. Notation: known |Ba k| = the exact number of attacking Byzantine workers at iteration k is known to each participant, L = smoothness constant (see Def. E.1), 0 = f(x0) f , f = uniform lower bound for f, σ2 = variance parameter from As. 3.1, n = the initial number of peers, b = the initial number of Byzantine workers, δ = b/n, m = number of peers checked at each iteration, R0 = x0 x . Assumptions Convexity of f Non-convex Convex Strongly convex As. 3.1+ As. 3.2 L 0 mε2 LR2 0 ε + σ2R2 0 nε2 + n δσR0 mε L µ log µR2 0 ε + σ2 δσ m µε + known |Ba k| As. 3.1 + As. 3.2 L 0 nε4 + n2δσ2 mε2 LR2 0 ε + σ2R2 0 nε2 + n2δσR0 mε L µ log µR2 0 ε + σ2 blocks for aggregation in BTARD are chosen uniformly at random (see Appendix E.3.1). In order to further reduce overhead from Verification 3 in the full Algorithm 6, we also assume that the stochastic gradient distributions have sub-quadratically decreasing tails (see Appendix E.3.1). Assumption 3.2. There exist such constant σ 0, s0 [d] that for any set of indices S = (i1, . . . , is), 1 i1 < i2 < . . . < is d, s s0 and any t > 0 stochastic gradient f(x, ξ) satisfy i=1 [S]f(x, ξi) [S]f(x) where ξ1, . . . , ξk are i.i.d. samples from D, and [S]f(x, ξ), [S]f(x) are defined in As. 3.1. Under these assumptions, we derive the following convergence bounds for strongly convex, generally convex, and non-convex objectives (see Table 1). The respective proofs and further details are deferred to Appendix E.3. Discussion of the convergence bounds. Let us briefly discuss the main properties of the derived results. When δ = 0 (there are no Byzantine peers), we recover the tightest known rates for parallel SGD for strongly convex, generally convex, and non-convex objectives with both sets of assumptions. Next, we notice that in all complexity bounds in the known |Ba k| case, the term depending on the ratio of Byzantine workers δ (the third one in all bounds) has better dependence on the accuracy of the solution ε than the classical variance term (the second one in all bounds). Therefore, for sufficiently small ε, the derived complexity bounds are the same as in the case when there are no Byzantine workers and parallel SGD is used. However, these bounds are obtained under the assumption that all participants know the exact number of attacking Byzantine workers at each iteration, which is not realistic but helps to better adjust clipping parameter τ in CENTEREDCLIP. As for the more general case, the third term is much worse than the corresponding term in the previous setup. Nevertheless, the term that depends on the ratio of Byzantine workers δ has the same dependence on ε as in the known |Ba k| case. This implies that for sufficiently small ε the derived complexity bounds are the same as in the case when there are no Byzantine workers and parallel SGD is used. We provide the complete formulations and proofs in Appendix E.3. Finally, the derived convergence results are superior to the previous state-of-the-art ones even in the PS setup if ε is sufficiently small. For example, in the non-convex case, Karimireddy et al. (2020) show the O(1/ε2 + σ2/nε4 + δσ2/ε4)5 complexity bound for a version of MOMENTUM-SGD that uses CENTEREDCLIP aggregation rule when PS is available. When there is at least one Byzantine peer, the leading term in the above bound is O(δσ2/ε4) since δ 1/n. In contrast, when ε is sufficiently small, i.e., ε O( p L 0m/n3δσ2) (see Table 1), the leading term in our bound is O(σ2/nε4), which is better than O(δσ2/ε4). However, it is worth mentioning the differences between the setups. Although we do not assume the existence of PS, our algorithm and theoretical analysis rely on the usage of part of the workers to check the computations of some other workers and we allow bans of the participants. This is the key feature of our algorithm allowing us to obtain the improvement. See the detailed comparison with other works in Appendix A.1. Intuition behind the proofs. First, we show that all possible violations of our protocol either lead to the instant ban of a Byzantine peer or (with some positive probability) to the ban during the checks of computations following each iteration of the algorithm (Appendix D.5). Next, we upperbound the (expected squared) shifts that Byzantines can create at each iteration (Lemmas E.3 and E.4). Therefore, in expectation, Byzantines can deviate from the protocol only a finite number of times, and the power of their attacks 5For simplicity we omit numerical factors, logarithmic terms depending on the parameters of the problem, and factors, quantifying suboptimality of the starting point, i.e., R0 = x0 x and f(x0) infx Rd f(x). Secure Distributed Training at Scale is limited at each particular iteration. Using these results, we analyze BTARD-SGD as parallel SGD with shifted updates, where the shifts are bounded and exist during the finite number of steps (see Appendices E.3.3 E.3.5). Results for heavy-tailed problems. So far, all our convergence results rely on As. 3.2, i.e., that the stochastic gradients have not too heavy tails. This assumption holds for many real-world neural networks. However, there are important NLP tasks such as BERT training (Zhang et al., 2020), where the noise in the stochastic gradient has so heavy tails that As. 3.2 becomes unrealistic. The third and final setup in our analysis aims to address such heavy-tailed problems with BTARD-CLIPPED-SGD (Algorithm 9 in Appendix E.4). We analyze the method under the assumption that α-th moments of the stochastic gradients are uniformly upper-bounded for some α (1, 2]. We notice that for α < 2 this assumption allows the variance of the stochastic gradient to be unbounded. In this setting, we prove that BTARDCLIPPED-SGD finds an ε-solution of the convex problem after O ε α/(α 1) 1 + n δ/m α/(α 1) iterations when the number of attacking Byzantine peers is known at each iteration and O ε α/(α 1) 1 + n2δ2/m α/(α 1) iterations otherwise. One can find the full statements and complete proofs of our results in Appendix E. 3.3. Resisting Sybil attacks The algorithm described in Section 3.1 operates with a predefined list of peers that can only decrease in size. However, many real-world scenarios would benefit from new peers joining midway through training. Unfortunately, this exposes the system to Sybil attacks (Douceur, 2002), when a single computationally constrained attacker adopts multiple pseudonymous identities in order to establish a dishonest majority and break the algorithm. To handle this, one may augment BTARD with a heuristic protocol that dictates how new peers can join the experiment. A new participant must prove that it has honestly computed enough gradients over multiple continuous iterations before it is allowed to actually contribute to the training. This ensures that the influence of Sybil attackers is proportional to their computing power (see details in Appendix F). 4. Experiments 4.1. CIFAR10 classification First, we evaluate our approach in controlled conditions. Our setup is a Res Net-18 (He et al., 2015) model trained to solve the CIFAR10 classification task (Krizhevsky et al.). We train the model on 16 peers (each peer processes 8 samples per batch) using SGD with Nesterov (1983) momentum and the cosine annealing learning rate (Loshchilov & Hutter, 2017). We use a tuned setup achieving 93.5% test accuracy. Our method has a hyperparameter τ responsible for clipping strength in CENTEREDCLIP. We experiment with τ = 10 (weaker clipping) and τ = 1 (stronger clipping). These values were chosen based on the maximal standard deviation of the gradient parts averaged by the workers during normal training, so that almost no vectors are clipped for the weaker clipping and almost half of the vectors are clipped for the stronger clipping scenario. We begin with using only 1 validator on each step. If a validator happens to be Byzantine, it never accuses its peers. We compare our method to the regular All-Reduce without clipping and the baselines that use a trusted parameter server: the original variant of CENTEREDCLIP (Karimireddy et al., 2020), the coordinate-wise and geometric medians. Some other popular robust aggregation techniques are omitted because they were shown to be inferior in Karimireddy et al. (2020). We run all iterative algorithms (such as CENTEREDCLIP) to convergence with ϵ = 10 6, as we have found that limiting the number of iterations can significantly decrease the final model quality (see Figure 9 in Appendix I.1). In addition to measuring training convergence, we evaluate our setup in presence of malicious peers. To test pessimistic conditions, we pick a setting where 7 of 16 peers are Byzantine (see Appendix I.1 for a setup with 3 Byzantines). We experiment with the following attack types: SIGN FLIPPING: each attacker sends the opposite of its true gradient. RANDOM DIRECTION: all attackers send large vectors pointed in a common random direction. LABEL FLIPPING: each attacker computes its gradient based on the cross-entropy loss with flipped labels. For CIFAR-10, we replace label l {0, ..., 9} with 9 l. DELAYED GRADIENT: attackers send their real gradients delayed by 1000 steps. INNER PRODUCT MANIPULATION (IPM): attackers send the average of all honest peers gradients multiplied by ϵ. We test ϵ = 0.1 (Xie et al. (2020) demonstrate its efficiency against the coordinate-wise median and Krum) and ϵ = 0.6 (Allen-Zhu et al. (2021) report it as the most efficient attack against their SAFEGUARDSGD). A LITTLE IS ENOUGH (ALIE): attackers collude to move the coordinate-wise median while still sending values inside the population variance. Baruch et al. (2019) show that this attack is effective against Trimmed Mean (Yin et al., 2018) and Krum (Blanchard et al., 2017). We further amplify the Byzantine gradients from the first two attacks by a large coefficient λ = 1000 so they would Secure Distributed Training at Scale 0 1000 2000 3000 4000 Test accuracy Training w/o attacks BTARD (ours), =1 BTARD (ours), =10 CClip with PS, =1 Geometric median Coord-wise median No defense 0 1000 2000 3000 4000 Attack: Sign flipping, step 1000 0 1000 2000 3000 4000 Attack: Random direction, step 1000 0 1000 2000 3000 4000 Test accuracy Attack: Delayed gradients, step 1000 0 1000 2000 3000 4000 Attack: Label flipping, step 1000 0 1000 2000 3000 4000 Attack: IPM ( = 0.1), step 1000 0 1000 2000 3000 4000 Training step Test accuracy Attack: IPM ( = 0.6), step 1000 0 1000 2000 3000 4000 Training step Attack: ALIE, step 1000 0 1000 2000 3000 4000 Training step Attack: ALIE, step 1000 BTARD (ours), =1, 2 validators Figure 3. The Res Net-18 test accuracy in the case of various attacks and robust aggregation techniques. dominate the aggregated gradient if no clipping is used. While in practice such attacks can be identified right away by the large gradient norms, we deliberately avoid doing that to test our clipping approach. Here, we make Byzantines behave honestly prior to step s = 1000, then simultaneously attack on each step until they are banned (i.e., attacks start at the early stage of training). In Appendix I.1, we have also evaluated our method in the case of s = 10,000 (i.e., attacks start closer to the convergence stage) and in the case of Byzantines attacking periodically, but found no significant differences in its behavior. We repeat each experiment 5 times and report the mean and range (between the minimum and maximum) of the test accuracy during at least 2000 steps after all Byzantines are banned. In our experiments, this usually happened within 150 steps after s. We observe that our method does not significantly worsen the speed of convergence compared to the All-Reduce baseline (see Figure 3, upper-left). On average, the final test accuracy after 25,000 steps is only 0.6% worse for τ = 1 and even 0.1% better for τ = 10. The other tested defenses have a similar effect, except for the coordinate-wise median, which does not converge even without attacks. Next, we find that BTARD with the stronger clipping and only 1 validator protects from all tested attack types except the ALIE attack, where we need 2 validators to guarantee recovery (see Figure 3). In general, the weaker clipping is most sensitive to the attacks with large magnitudes (sign flipping and random direction), while the stronger clipping is sensitive to the low-magnitude ALIE attack6. The other defenses fail to protect training from most attack types. We conclude that BTARD with the stronger clipping and 2 validators (i.e., 1/8 of the compute dedicated for validation) allows to quickly recover the pre-attack accuracy after all tested attack types even in the extreme case with 7 out of 16 peers being Byzantine. 4.2. Pre-training transformers In this section, we choose a more compute-intensive and hyperparameter-sensitive model with an adaptive optimizer 6This observation coincides with Baruch et al. (2019) demonstrating that the ALIE attack is more harmful against median-based and clipping approaches than to the mean aggregation without any defenses. Indeed, the stronger clipping makes CENTEREDCLIP closer to the geometric median, and the weaker clipping makes it closer to the mean (as explained in Appendix D.2). Secure Distributed Training at Scale 1000 1050 1100 1150 9.5 Attack: Sign flipping, step 1000 BTARD (ours), =0.5 BTARD (ours), =0.125 All-Reduce SGD w/o attacks 1000 1050 1100 1150 Attack: Label flipping, step 1000 1000 1050 1100 1150 Attack: Random direction, step 1000 5000 5050 5100 5150 Training step Attack: Sign flipping, step 5000 5000 5050 5100 5150 Training step Attack: Label flipping, step 5000 5000 5050 5100 5150 Training step 2.4 Attack: Random direction, step 5000 Figure 4. The ALBERT-large training objective in the case of BTARD-CLIPPED-SGD (in presence of various attacks) and the standard All-Reduce SGD (without attacks). to demonstrate that our approach may be applied to the models commonly used in distributed training scenarios. Our setup is pre-training ALBERT-large (Lan et al., 2019) on the Wiki Text-103 dataset (Merity et al., 2017) using the LAMB optimizer (You et al., 2020) (see details in Appendix H). Since the original ALBERT setup uses gradient clipping, we use BTARD-CLIPPED-SGD (see Algorithm 9 in Appendix E.4). We train the model on 16 machines that jointly accumulate 4096 samples for every batch. We evaluate the All-Reduce baseline without attacks, as well as BTARD with weaker and stronger clipping (larger and smaller τ respectively) in presence of attackers. In the last case, we make 7 workers malicious, use 1 validator, and test two attack regions: s = 1000 and s = 5000. We omit reporting of the delayed gradient attack (due to its inefficiency), as well as ALIE and IPM attacks (they require Byzantines to make an extra All-Reduce round on each step, which is hard to do efficiently in the real multi-host setup). As in the previous section, we observe that without attacks both τ values have no significant effect on the training progress, reaching only 1.3% larger loss in the worst case. However, the stronger clipping shows faster recovery from the tested attacks (see Figure 4). Crucially, while some attacks significantly increase the loss function, the model recovers much faster than it takes to reach the pre-attack loss when training from scratch. We also report the computation overhead of BTARD-SGD in this setup in Appendix I.2 and conduct experiments with 64 machines and most efficient attacks in Appendix I.3, confirming that BTARD remains efficient at a larger scale. 5. Conclusion In this work, we formulated BTARD-SGD a Byzantinetolerant training strategy for large neural networks. We verified its robustness and efficiency through theoretical analysis and large-scale distributed training experiments. Our research opens new opportunities in many deep learning applications, making it possible to train large neural networks in a cooperative manner. Small research groups can host open cooperative training projects where the training hardware is crowdsourced by volunteers around the world, or a group of small companies can compete with larger corporations by combining their compute clusters. While these applications also require engineering effort to become practical, our algorithm ensures that they can run securely without the need to screen every potential participant. Acknowledgments This work was partially supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138. We thank Sai Praneeth Karimireddy for useful discussions and suggestions, Lie He for providing the code with CENTEREDCLIP, William Cappelletti for pointing out several relevant papers, Gennady Pekhimenko for his technical expertise and infrastructure for the distributed training experiments, and Dmitrii Emelianenko for helpful discussions. Secure Distributed Training at Scale Aad, G., Abajyan, T., and Collaboration, T. A. Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC. Physics Letters B, 716:1 29, 09 2012. Abbott, B., Collaboration, L. S., and Collaboration, V. Observation of Gravitational Waves from a Binary Black Hole Merger. Physical Review Letters, 116, 02 2016. doi: 10.1103/Phys Rev Lett.116.061102. Alistarh, D., Allen-Zhu, Z., and Li, J. Byzantine stochastic gradient descent. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4618 4628, 2018. Allen-Zhu, Z., Ebrahimianghazani, F., Li, J., and Alistarh, D. Byzantine-resilient non-convex stochastic gradient descent. In International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=Pb EHqv Ftc S. Anderson, D. P. Boinc: A system for public-resource computing and storage. In Fifth IEEE/ACM international workshop on grid computing, pp. 4 10. IEEE, 2004. Atre, M., Jha, B., and Rao, A. Distributed deep learning using volunteer computing-like paradigm, 2021. Balakrishnan, H., Kaashoek, M. F., Karger, D., Morris, R., and Stoica, I. Looking up data in p2p systems. Communications of the ACM, 46(2):43 48, 2003. Baruch, G., Baruch, M., and Goldberg, Y. A little is enough: Circumventing defenses for distributed learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/ paper/2019/file/ ec1c59141046cd1866bbbcdfb6ae31d4Paper.pdf. Beberg, A. L., Ensign, D., Jayachandran, G., Khaliq, S., and Pande, V. Folding@home: Lessons from eight years of volunteer distributed computing. 2009 IEEE International Symposium on Parallel & Distributed Processing, pp. 1 8, 2009. Ben-Ameur, W., Bianchi, P., and Jakubowicz, J. Robust consensus in distributed networks using total variation. ar Xiv preprint ar Xiv:1309.7264, 2013. Blanchard, P., El Mhamdi, E. M., Guerraoui, R., and Stainer, J. Machine learning with adversaries: Byzantine tolerant gradient descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 118 128, 2017. Blum, M. Coin flipping by telephone a protocol for solving impossible problems. ACM SIGACT News, 15(1):23 27, 1983. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877 1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/ paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a Paper.pdf. Bulusu, S., Khanduri, P., Sharma, P., and Varshney, P. K. On distributed stochastic gradient descent for nonconvex functions in the presence of byzantines. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3137 3141. IEEE, 2020. Chen, L., Wang, H., Charles, Z., and Papailiopoulos, D. Draco: Byzantine-resilient distributed training via redundant gradients. In International Conference on Machine Learning, pp. 903 912. PMLR, 2018. Cleve, R. Limits on the security of coin flips when half the processors are faulty. In Proceedings of the eighteenth annual ACM symposium on Theory of computing, pp. 364 369, 1986. Damaskinos, G., El Mhamdi, E. M., Guerraoui, R., Guirguis, A. H. A., and Rouault, S. L. A. Aggregathor: Byzantine machine learning via robust gradient aggregation. In The Conference on Systems and Machine Learning (Sys ML), 2019, 2019. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M. a., Senior, A., Tucker, P., Yang, K., Le, Q., and Ng, A. Large scale distributed deep networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems, volume 25, pp. 1223 1231. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/ paper/2012/file/ 6aca97005c68f1206823815f66102863Paper.pdf. Secure Distributed Training at Scale Defazio, A. and Bottou, L. On the ineffectiveness of variance reduced optimization for deep learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/ paper/2019/file/ 84d2004bf28a2095230e8e14993d398d Paper.pdf. Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances In Neural Information Processing Systems, 2014. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT, 2019. Diskin, M., Bukhtiyarov, A., Ryabinin, M., Saulnier, L., Lhoest, Q., Sinitsin, A., Popov, D., Pyrkin, D., Kashirin, M., Borzunov, A., del Moral, A. V., Mazur, D., Kobelev, I., Jernite, Y., Wolf, T., and Pekhimenko, G. Distributed deep learning in open collaborations. Co RR, abs/2106.10207, 2021. URL https://arxiv.org/ abs/2106.10207. Douceur, J. R. The sybil attack. In IPTPS, 2002. El Mhamdi, E. M., Guerraoui, R., and Rouault, S. The hidden vulnerability of distributed learning in Byzantium. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3521 3530. PMLR, 10 15 Jul 2018. URL http://proceedings.mlr.press/ v80/mhamdi18a.html. El-Mhamdi, E.-M., Guerraoui, R., Guirguis, A., Hoang, L. N., and Rouault, S. Genuinely distributed byzantine machine learning. In Proceedings of the 39th Symposium on Principles of Distributed Computing, pp. 355 364, 2020. Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469 1492, 2012. Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. Gorbunov, E., Kovalev, D., Makarenko, D., and Richtárik, P. Linearly converging error compensated sgd. Advances in Neural Information Processing Systems, 33, 2020. Gorbunov, E., Burlachenko, K., Li, Z., and Richtárik, P. Marina: Faster non-convex distributed learning with compression. ar Xiv preprint ar Xiv:2102.07845, 2021. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017. Goyal, P., Duval, Q., Reizenstein, J., Leavitt, M., Xu, M., Lefaudeux, B., Singh, M., Reis, V., Caron, M., Bojanowski, P., Joulin, A., and Misra, I. Vissl, 2021. URL https://github.com/ facebookresearch/vissl. Gupta, N. and Vaidya, N. H. Byzantine fault-tolerance in peer-to-peer distributed gradient-descent. ar Xiv preprint ar Xiv:2101.12316, 2021. Gupta, N., Doan, T. T., and Vaidya, N. H. Byzantine fault-tolerance in decentralized optimization under 2fredundancy. In 2021 American Control Conference (ACC), pp. 3632 3637. IEEE, 2021. Hammurabi, K. o. B. and Harper, R. F. The Code of Hammurabi, King of Babylon: About 2250 BC: Autographed Text, Transliteration, Translation, Glossary Index of Subjects, Lists of Proper Names, Signs, Numuerals... University of Chicago Press, 1904. URL https://books.google.ru/ books?id=je Lz_BYUoe QC&pg=PA11. Page 11, 1. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2015. He, L., Karimireddy, S. P., and Jaggi, M. Byzantine-robust learning on heterogeneous datasets via resampling. ar Xiv preprint ar Xiv:2006.09365v3, 2020. Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q. V., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Ar Xiv, abs/1811.06965, 2019. Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., and Guo, C. A unified architecture for accelerating distributed DNN training in heterogeneous gpu/cpu clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp. 463 479. USENIX Association, November 2020. ISBN 9781-939133-19-9. URL https://www.usenix.org/ conference/osdi20/presentation/jiang. Karimireddy, S. P., He, L., and Jaggi, M. Learning from history for byzantine robust optimization. ar Xiv preprint ar Xiv:2012.10333v3, 2020. Secure Distributed Training at Scale Kijsipongse, E., Piyatumrong, A., and U-ruekolan, S. A hybrid gpu cluster and volunteer computing platform for scalable deep learning. The Journal of Supercomputing, 04 2018. doi: 10.1007/s11227-018-2375-9. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and Houlsby, N. Big transfer (bit): General visual representation learning. In ECCV, 2020. Koloskova, A., Lin, T., Stich, S. U., and Jaggi, M. Decentralized deep learning with arbitrary communication compression. In International Conference on Learning Representations, 2020. URL https://openreview.net/ forum?id=Skg GCkr Kv H. Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 (canadian institute for advanced research). URL http:// www.cs.toronto.edu/~kriz/cifar.html. Lan, Z.-Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. Albert: A lite bert for selfsupervised learning of language representations. Ar Xiv, abs/1909.11942, 2019. Li, L., Xu, W., Chen, T., Giannakis, G. B., and Ling, Q. Rsa: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 1544 1551, 2019. Li, M. Scaling distributed machine learning with the parameter server. In Proceedings of the 2014 International Conference on Big Data Science and Computing, Big Data Science 14, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450328913. doi: 10.1145/2640087.2644155. URL https://doi.org/ 10.1145/2640087.2644155. Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., Vaughan, B., Damania, P., and Chintala, S. Pytorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005 3018, August 2020. ISSN 21508097. doi: 10.14778/3415478.3415530. URL https: //doi.org/10.14778/3415478.3415530. Li, Z., Davis, J., and Jarvis, S. An efficient task-based all-reduce for machine learning applications. In Proceedings of the Machine Learning on HPC Environments, MLHPC 17, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450351379. doi: 10.1145/3146347.3146350. URL https://doi.org/ 10.1145/3146347.3146350. Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, W. J. Deep Gradient Compression: Reducing the communication bandwidth for distributed training. In The International Conference on Learning Representations, 2018. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. Ar Xiv, abs/1907.11692, 2019. Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR) 2017 Conference Track, April 2017. Lyu, L., Yu, H., Ma, X., Sun, L., Zhao, J., Yang, Q., and Yu, P. S. Privacy and robustness in federated learning: Attacks and defenses. ar Xiv preprint ar Xiv:2012.06337, 2020. Maymounkov, P. and Mazieres, D. Kademlia: A peer-topeer information system based on the xor metric. In International Workshop on Peer-to-Peer Systems, pp. 53 65. Springer, 2002. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273 1282, 2017. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. Ar Xiv, abs/1609.07843, 2017. Merritt, R. Folding@home gets 1.5+ Exaflops to Fight COVID-19, 04 2020. https://blogs.nvidia.com/blog/ 2020/04/01/foldingathome-exaflopcoronavirus/(accessed on Apr 29, 2021). Mikami, H., Suganuma, H., U-chupala, P., Tanaka, Y., and Kageyama, Y. Massively distributed sgd: Imagenet/resnet-50 training in a flash, 2019. Mishchenko, K., Gorbunov, E., Takáˇc, M., and Richtárik, P. Distributed learning with compressed gradient differences. ar Xiv preprint ar Xiv:1901.09269, 2019. Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Zaharia, M. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 19, pp. 1 15, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368735. doi: 10.1145/3341301.3359646. URL https://doi.org/ 10.1145/3341301.3359646. Narayanan, D., Shoeybi, M., Casper, J., Le Gresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., et al. Efficient large-scale language model training on gpu clusters. ar Xiv preprint ar Xiv:2104.04473, 2021. Secure Distributed Training at Scale Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009. Nesterov, Y. A method for solving the convex programming problem with convergence rate o(1/k2). Proceedings of the USSR Academy of Sciences, 269:543 547, 1983. Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024 8035, 2019. Patarasuk, P. and Yuan, X. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput., 69(2):117 124, February 2009. ISSN 07437315. doi: 10.1016/j.jpdc.2008.09.002. URL https: //doi.org/10.1016/j.jpdc.2008.09.002. Peng, J., Li, W., and Ling, Q. Byzantine-robust decentralized stochastic optimization over static and time-varying networks. Signal Processing, 183:108020, 2021. Pillutla, K., Kakade, S. M., and Harchaoui, Z. Robust aggregation for federated learning. ar Xiv preprint ar Xiv:1912.13445, 2019. Rabin, T. and Ben-Or, M. Verifiable secret sharing and multiparty protocols with honest majority. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, pp. 73 85, 1989. Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1 16, 2020. Rajput, S., Wang, H., Charles, Z., and Papailiopoulos, D. Detox: A redundancy-based framework for faster and more robust gradient aggregation. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/ paper/2019/file/ 415185ea244ea2b2bedeb0449b926802Paper.pdf. Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pp. 693 701, 2011. Regatti, J., Chen, H., and Gupta, A. Bygars: Byzantine sgd with arbitrary number of attackers. ar Xiv preprint ar Xiv:2006.13421, 2020. Rivest, R. L., Shamir, A., and Adleman, L. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21(2):120 126, 1978. Rodríguez-Barroso, N., Martínez-Cámara, E., Luzón, M., Seco, G. G., Veganzones, M. Á., and Herrera, F. Dynamic federated learning model for identifying adversarial clients. ar Xiv preprint ar Xiv:2007.15030, 2020. Rowstron, A. and Druschel, P. Pastry: Scalable, decentralized object location, and routing for large-scale peer-topeer systems. In IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing, pp. 329 350. Springer, 2001. Ruttley, T., Robinson, J., and Gerstenmaier, W. The international space station: Collaboration, utilization, and commercialization*: The international space station. Social Science Quarterly, 98:1160 1174, 12 2017. doi: 10.1111/ssqu.12469. Ryabinin, M. and Gusev, A. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 3659 3672. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/ paper/2020/file/ 25ddc0f8c9d3e22e03d3076f98d83cb2Paper.pdf. Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715 1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/ v1/P16-1162. URL https://www.aclweb.org/ anthology/P16-1162. Sergeev, A. and Balso, M. D. Horovod: fast and easy distributed deep learning in tensorflow, 2018. Shoeybi, M., Patwary, M., Puri, R., Le Gresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. ar Xiv preprint ar Xiv:1909.08053, 2019. Secure Distributed Training at Scale Smith, B. Flakey amd/ati gpus, including rx 5700 xt, cross validating, polluting the database, 2019. URL https://setiathome.berkeley.edu/ forum_thread.php?id=84508. Accessed: 202105-20. Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017. Tolpegin, V., Truex, S., Gursoy, M. E., and Liu, L. Data poisoning attacks against federated learning systems. In ESORICS, 2020. Trifa, Z. and Khemakhem, M. Sybil nodes as a mitigation strategy against sybil attack. Procedia Computer Science, 32:1135 1140, 2014. Urdaneta, G., Pierre, G., and Steen, M. V. A survey of dht security techniques. ACM Computing Surveys (CSUR), 43(2):1 49, 2011. Vyzovitis, D., Napora, Y., Mc Cormick, D., Dias, D., and Psaras, Y. Gossip Sub: Attack-resilient message propagation in the Filecoin and ETH2.0 networks. ar Xiv preprint ar Xiv:2007.02754, 2020. Wang, L. and Kangasharju, J. Real-world sybil attacks in bittorrent mainline dht. In 2012 IEEE Global Communications Conference (GLOBECOM), pp. 826 832. IEEE, 2012. Wu, Z., Ling, Q., Chen, T., and Giannakis, G. B. Federated variance-reduced stochastic gradient descent with robustness to byzantine attacks. IEEE Transactions on Signal Processing, 68:4583 4596, 2020. Xie, C., Koyejo, O., and Gupta, I. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation. In Uncertainty in Artificial Intelligence, pp. 261 270. PMLR, 2020. Xu, X. and Lyu, L. Towards building a robust and fair federated learning system. ar Xiv preprint ar Xiv:2011.10464, 2020. Yang, Z. and Bajwa, W. U. Bridge: Byzantineresilient decentralized gradient descent. ar Xiv preprint ar Xiv:1908.08098, 2019a. Yang, Z. and Bajwa, W. U. Byrdie: Byzantine-resilient distributed coordinate descent for decentralized learning. IEEE Transactions on Signal and Information Processing over Networks, 5(4):611 627, 2019b. Yin, D., Chen, Y., Kannan, R., and Bartlett, P. Byzantinerobust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pp. 5650 5659. PMLR, 2018. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=Syx4wn Etv H. Zhang, E., Liu, F.-H., Lai, Q., Jin, G., and Li, Y. Efficient multi-party private set intersection against malicious adversaries. In Proceedings of the 2019 ACM SIGSAC Conference on Cloud Computing Security Workshop, pp. 93 104, 2019. Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S. Why are adaptive methods good for attention models? In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 15383 15393. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/ paper/2020/file/ b05b57f6add810d3b7490866d74c0053Paper.pdf. Zhao, B., Huang, L., Stribling, J., Rhea, S., Joseph, A., and Kubiatowicz, J. Tapestry: A resilient global-scale overlay for service deployment. IEEE Journal on Selected Areas in Communications, 22, 07 2003. doi: 10.1109/ JSAC.2003.818784. Zinkevich, M., Weimer, M., Li, L., and Smola, A. Parallelized stochastic gradient descent. In Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., and Culotta, A. (eds.), Advances in Neural Information Processing Systems, volume 23, pp. 2595 2603. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/ paper/2010/file/ abea47ba24142ed16b7d8fbf2c740e0d Paper.pdf. Secure Distributed Training at Scale Table of contents 1 Introduction 1 2 Related work 2 2.1 Distributed deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Byzantine-tolerant optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.3 Security in distributed systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Method 3 3.1 Byzantine-Tolerant All-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Resisting Sybil attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4 Experiments 7 4.1 CIFAR10 classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Pre-training transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5 Conclusion 9 A Additional related work 16 A.1 Byzantine-tolerant optimization: additional details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.1.1 Parameter-server (PS) based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.1.2 Decentralized approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Multi-party random number generators: additional details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B Synchronization points and computation overhead of BTARD-SGD 19 C Overview of attack vectors 20 D Detailed algorithm description 21 D.1 Basic building blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D.2 Centered Clip and verification of its results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D.3 Protocols for banning Byzantine peers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 D.4 Butterfly Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 D.5 Byzantine-tolerant All-Reduce and its verification procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 D.6 BTARD-SGD training loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 E Convergence analysis: missing proofs and extra details 27 E.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 E.2 Impossibility of Byzantine-tolerant learning in heterogeneous case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 E.3 Convergence guarantees for BTARD-SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 E.3.1 On Assumptions 3.1 and 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 E.3.2 Quality of the aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 E.3.3 Non-convex case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 E.3.4 Convex case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 E.3.5 Strongly convex case: Restarted-BTARD-SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 E.4 Convergence guarantees for BTARD-Clipped-SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 E.4.1 Quality of the aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 E.4.2 Convex case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 E.4.3 Strongly convex case: Restarted-BTARD-Clipped-SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 F Resisting Sybil attacks 55 G Secure distributed hash tables 57 H Details of the ALBERT experiment setup 57 I Additional experiments 58 I.1 Extra evaluations on the CIFAR10 classification task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 I.2 Evaluating computation overhead in terms of wall time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 I.3 Experiments at a larger scale (64 machines) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Secure Distributed Training at Scale A. Additional related work A.1. Byzantine-tolerant optimization: additional details In this section, we provide extra details on the related work discussed in Section 2.2. The summary of complexity results is presented in Table 2. A.1.1. PARAMETER-SERVER (PS) BASED APPROACHES There is a quite large number of papers on Byzantine-tolerant optimization that aim to robustify parallel SGD in the case when a trusted parameter-server (PS) is available. Since in the classical parallel SGD even one Byzantine worker can break the convergence of the whole method by shifting the mean of the resulting vector in an arbitrary way, it is natural to substitute averaging of the vectors received from the workers by a more robust aggregation rule, e.g., Krum (Blanchard et al., 2017), coordinate-wise median, trimmed median (Yin et al., 2018), Multi-Krum (Damaskinos et al., 2019), Bulyan (El Mhamdi et al., 2018), geometric median (Pillutla et al., 2019). However, all these methods were shown to be brittle and not robust to special types of Byzantine attacks (Baruch et al., 2019; Xie et al., 2020; Karimireddy et al., 2020). Moreover, Karimireddy et al. (2020) show that all permutation-invariant algorithms cannot converge to any predefined accuracy of the solution, meaning that simple application of some aggregation rules on top of SGD does not lead to Byzantine tolerance. There are several approaches to circumvent this issue. Alistarh et al. (2018) propose BYZANTINESGD and prove the convergence results for convex problems. Allen-Zhu et al. (2021) extend this approach to handle non-convex problems as well. In both papers, the key idea is based on applying the concentration properties of the sums depending on the stochastic gradients as well as iterative removing of Byzantine peers. However, theoretical guarantees from Alistarh et al. (2018); Allen-Zhu et al. (2021) rely on the restrictive assumption that the noise in the stochastic gradients is uniformly bounded with probability 1. Bulusu et al. (2020) propose similar approach to the one from (Allen-Zhu et al., 2021) but analyze their method under more restrictive assumptions (boundedness of the gradient). Next, Wu et al. (2020) propose a Byzantine-tolerant version of parallel SAGA (Defazio et al., 2014), i.e., variance-reduced version of SGD, with geometric median as an aggregation rule BYRD-SAGA and prove its convergence for strongly convex objectives. However, the authors do not establish the convergence of BYRD-SAGA to any predefined accuracy of the solution. Moreover, variance-reduced methods are known to converge slowly in deep learning applications (Defazio & Bottou, 2019), which limits the practical utility of BYRD-SAGA. Finally, Karimireddy et al. (2020) propose a new aggregation rule called CENTEREDCLIP, apply it to SGD with client momentum, and prove convergence results for the obtained method in the non-convex case under reasonable assumptions. Alternative lines of work achieve Byzantine-tolerant optimization through redundant computations (Chen et al., 2018; Rajput et al., 2019) or reputation-based approaches (Rodríguez-Barroso et al., 2020; Regatti et al., 2020; Xu & Lyu, 2020). Unfortunately, these papers either do not contain theoretical (non-asymptotic) convergence results for the proposed methods or rely on too restrictive assumptions in the analysis. See more references in the recent survey by Lyu et al. (2020). A.1.2. DECENTRALIZED APPROACHES Byzantine-tolerant optimization methods for decentralized communication architectures are studied only in a couple of papers. Yang & Bajwa (2019a;b) consider a specific scenario when workers compute full gradients, local loss functions on peers are heterogeneous, and the trimmed coordinate-wise median is used as an aggregation rule. In this setup, the authors prove convergence results in the strongly convex case to some accuracy depending on the heterogeneity level of local loss functions, which is natural in the presence of Byzantine peers. However, these results are not applicable to a wide range of practically important problems where stochastic gradients have to be used. This issue was partially resolved in Peng et al. (2021), where the authors propose a version of GOSSIP SGD applied to the equivalent reformulation of the original problem based on TV-regularization (Ben-Ameur et al., 2013). However, the established convergence results in the strongly convex case do not show any benefits of using communications with other workers in the homogeneous data regime that appears in large-batch training of deep learning models. Li et al. (2019) use the same idea for a parameter-server architecture. Next, there are approaches requiring peer-to-peer communications of full vectors at each step (Gupta et al., 2021; Gupta & Vaidya, 2021), which is not scalable. Secure Distributed Training at Scale Table 2. Summary of the complexity results for Parameter-Server (PS) based and distributed Byzantine-tolerant optimization. The columns Non-convex , Convex , and Strongly convex contain the complexity bounds for L-smooth non-convex, convex, and µ-strongly convex problems respectively. By complexity we mean the number of iterations sufficient to find such point bx that E[ f(bx) 2] ε2 for non-convex problems and E[f(bx) f(x )] ε for convex and µ-strongly convex problems (see Def. E.2) with x being the solution. For simplicity, we omit numerical factors, logarithmic terms depending on the parameters of the problem, and factors, quantifying suboptimality of the starting point, i.e., R0 = x0 x and f(x0) infx Rd f(x). Notation: δ = |B|/n, m = number of peers checked at each iteration. The results from Yang & Bajwa (2019a;b) are not included since they rely on full-gradient computations. Non-PS? Work Non-convex Convex Strongly convex (Alistarh et al., 2018)(1),(2) 1 ε + σ2 ε2 1 µ + σ2 µε (Allen-Zhu et al., 2021)(1),(3) 1 nε4 + δ2 ε4 (Wu et al., 2020)(4) L2 (Karimireddy et al., 2020)(6) 1 ε2 + σ2 (Peng et al., 2021)(6),(7) 1 µε + nσ2 µ2ε + λ2d N2 µ2ε This work(8) 1 ε2 + σ2 mε2 1 ε + σ2 δσ mε 1 µ + σ2 δσ m µε This work(9) 1 ε2 + σ2 nε4 + n2δσ2 mε2 1 ε + σ2 mε 1 µ + σ2 nµε + n2δσ m µε This work(10) ε α α 1 G2Λ1 µε α 2(α 1) This work(11) ε α α 1 G2Λ2 µε α 2(α 1) (1) The results are proven under uniformly bounded noise assumption: f(x, ξ) f(x) σ for all x and ξ. High-probability guarantees are established, i.e., it is proven that with probability at least 1 β algorithms from (Alistarh et al., 2018) find ˆx such that f(ˆx) f(x ) ε and algorithms from (Allen-Zhu et al., 2021) find ˆx such that f(ˆx) ε. (2) Dependencies on β are logarithmic and, therefore, omitted. The optimization problems are assumed to be defined on a bounded set, the rates depend on the diameter of this set. (3) The results are derived for the case σ = 1. Allen-Zhu et al. (2021) also derive convergence guarantees for finding second-order stationary points. (4) Wu et al. (2020) consider finite-sum case of (3), i.e., f(x) = 1 N PN j=1 f(x, j). The results are derived under the uniformly bounded variance (UBV) assumption: Ej[ f(x, j) f(x) 2] σ2 for all x Rd, where j is sampled uniformly at random from {1, . . . , N}. Wu et al. (2020) also derive convergence guarantees under ζ-bounded dissimilarity assumption, i.e., when f(x) = 1 |G| P i G fi(x), fi(x) = 1 N PN j=1 fi(x, j) for all i G, and 1 |G| P i G fi(x) f(x) 2 ζ2. (5) This result is obtained the main result of (Wu et al., 2020) and states that the method from (Wu et al., 2020) finds ˆx such that f(ˆx) f(x ) ε only for ε σ2/µ2( 1 2 δ) 2, which can be large. (6) The result is derived under the UBV assumption, i.e., Eξ D[ f(x, ξ) f(x) 2] σ2 for all x Rd. (7) Peng et al. (2021) consider the case, when peers are allowed to communicate with their neighbors that are defined via some communication graph. The result establishes the total number of iterations/communication rounds needed to find ˆx such that E ˆx x 2 ε for ε λ2d i G |Bi|2, where λ 0 is any non-negative number and Bi is the set of Byzantine peers neighboring with the i-th peer. In the complexity result, we use the notation N 2 = P i G(|Gi|2 + |Bi|2), where Gi is the set of good neighbors of the i-th peer. When λ = 0, the workers do not communicate at all. Moreover, Peng et al. (2021) analyze the case of heterogeneous local functions, composite optimization problems and time-varying setup but in that case λ is lower bounded by a strictly positive quantity depending on the heterogeneity level and minimal non-zero singular value of the node-edge incidence matrix, i.e., any predefined accuracy cannot be achieved. (8) The results are derived for BTARD-SGD (in the strongly convex case, for RESTARTED-BTARD-SGD) under Assumptions 3.1 and 3.2 in the case when the exact number of attacking Byzantine workers at iteration k is known to each participant. See Theorems E.2, E.4, and E.6. (9) The results are derived for BTARD-SGD (in the strongly convex case, for RESTARTED-BTARD-SGD) under Assumptions 3.1 and 3.2. See Theorems E.3, E.5, and E.7. (10) The results are derived for BTARD-CLIPPED-SGD (in the strongly convex case, for RESTARTED-BTARDCLIPPED-SGD) under Assumption E.1 without any additional assumptions on the tails of the distribution. Moreover, it is assumed that the exact number of attacking Byzantine workers at iteration k is known to each participant. See Theorems E.8 and E.10. In the complexity results, we use the notation Λ1 = 1 + n δ m . (11) The results are derived for BTARD-CLIPPED-SGD (in the strongly convex case, for RESTARTED-BTARDCLIPPED-SGD) under Assumption E.1 without any additional assumptions on the tails of the distribution. See Theorems E.9 and E.11. In the complexity results, we use the notation Λ2 = 1 + n2δ Secure Distributed Training at Scale Finally, El-Mhamdi et al. (2020) propose an algorithm based on the usage of multiple servers. The authors assume that both workers and servers can be Byzantines, which is a realistic scenario. However, their approach requires the workers to send their gradients to all servers at each iteration and receive parameters from all servers as well. This leads to a significant communication overhead in practice. Moreover, El-Mhamdi et al. (2020) do not provide non-asymptotic convergence rates, making it problematic to provide an in-depth comparison with existing works and with our results as well. Therefore, it is unclear whether the usage of multiple servers speeds up training or it just leads to overhead in the communications and computations. In contrast, our results do benefit from the communications between workers. First of all, as one can see from Table 2, the terms depending on the fraction δ of Byzantine peers in our complexity bounds for BTARD-SGD and RESTARTEDBTARD-SGD (the third terms) have better dependence on the target accuracy ε than the corresponding terms in the complexity bounds from all previous works (even from those relying on the existence of a PS). Moreover, for sufficiently small ε these terms in our complexity results are smaller than the second terms, which correspond to the main term in the complexity of parallel SGD. That is, BTARD-SGD/RESTARTED-BTARD-SGD applied to the problem with Byzantine peers has convergence guarantees that are not worse than the corresponding guarantees for parallel SGD applied to the problem without any Byzantine workers. In such regimes, our theoretical convergence results outperform even ones derived for PS-based algorithms. We notice that Assumptions 3.1 and 3.2 used in the analysis of BTARD-SGD/RESTARTED-BTARD-SGD are slightly stronger than uniformly bounded variance assumption used in (Wu et al., 2020; Karimireddy et al., 2020; Peng et al., 2021). However, as we explain in Appendix E.3.1, our analysis allows to relax Assumptions 3.1 to uniformly bounded variance assumption, and Assumption 3.2 is reasonable for many practically important problems. Finally, we also propose and analyze BTARD-CLIPPED-SGD and RESTARTED-BTARD-CLIPPED-SGD under Assumption E.1 that may hold even in the case of unbounded variance of the stochastic gradient. To the best of our knowledge, this is the first time in the literature on the Byzantine-tolerant optimization when the complexity results are obtained without assuming boundedness of the stochastic gradient s variance. A.2. Multi-party random number generators: additional details Many distributed systems may benefit from the multi-party random number generators (MPRNG) where a group of malicious peers would have little influence (bias) on the generator output. MPRNGs are usually based on multi-party coin tossing protocols, such as the protocol from Blum (1983). As an example, MPRNG allows to choose a participant winning a lottery or choose a peer whose calculations are going to be validated by other peers to detect possible cheating. While Blum (1983) formally introduces a protocol for one bit and two parties, its generalization to multiple bits and parties (as necessary for MPRNG) is trivial assuming the presence of the broadcast channel. This modification is widely known in literature, e.g., described in Zhang et al. (2019). According to this generalization, peers should execute the following protocol to obtain k random bits (see the intuitive scheme in Figure 5): 1. Each peer generates its own random string xi made of k bits. 2. Each peer broadcasts commitment hi = h(i||xi||si), where || denotes concatenation, h(x) is a common cryptographic hash function, i is the peer s unique identifier (known by other peers), and si is a large random string. 3. Peers wait until all of them finish broadcasting the commitments. After that, no peer can alter its xi to influence the protocol output (otherwise, peers would notice that the new value x i does not match the commitment). 4. Each peer reveals their random string by broadcasting its xi and si. 5. Each peer verifies that all other peers revealed values xj and sj that match their commitments hj = h(j||xj||sj). 6. If a peer detects that peer j aborted the procedure or its commitment does not match its revealed values, it concludes that we cannot trust peer j. Since other peers read the same broadcast channel, all of them can make the same conclusion. In this case, the system repeats the protocol. 7. If peers do not detect any mismatches, they calculate the protocol output x = x1 ... xn, where denotes the bitwise XOR operation. Secure Distributed Training at Scale Figure 5. A scheme of MPRNG based on the generalization of Blum (1983). Here, || denotes concatenation, denotes bitwise XOR, h(x) is a common cryptographic hash function. The hashed values include the peer identifier i to protect from the replay attacks and a large random string si to resist the dictionary attacks. In this protocol, the commitments include the peer identifier i to protect from replay attacks (when an attacker repeats someone else s message) and the large random string si to resist dictionary attacks (when an attacker reverses the hash function using a large dictionary of its values). While there are MPRNGs (Rabin & Ben-Or, 1989) with a negligible bias for the case when more than a half parties are honest (assuming the presence of the broadcast channel), Cleve (1986) proves that it is impossible to reach the negligible bias for the case of dishonest majority, which may be reached in practice with the Sybil attacks. However, we note that the bias in Blum (1983) (and its modification above) appears only in the case when an attacker learns the result earlier than other peers and forces the protocol to be repeated. If we are using MPRNG to choose a peer that to be checked for cheating, we may ban all peers that aborted the procedure and restart from scratch without them, therefore eliminating the bias problem. B. Synchronization points and computation overhead of BTARD-SGD Synchronization points. An important aspect of BTARD performance is synchronization. The naive implementation of Algorithm 1 would have many global synchronization barriers per step: one for aggregating gradients, another for choosing a random direction z, yet another for electing validators, etc. These frequent synchronizations could undermine the practical training performance of BTARD in high-latency networks, such as when training over the Internet. Fortunately, it is possible to reduce the number of synchronizations by bundling them together. For instance, peers use a single MPRNG round for sampling z and for electing validators. Furthermore, this MPRNG round and subsequent checks can be done in background, while a peer accumulates gradients for the next step. The only restriction is that this shared MPRNG round must be performed after all peers declare their checksums for that round. With these optimizations, BTARD-SGD requires only two points of synchronization per round. The first one occurs right before gradient aggregation, and the second one is in a background task that performs verifications. Finally, there is a non-regular need for synchronization when one peer accuses another of being Byzantine. However, as we elaborated earlier, each accusation will result in at least one Byzantine being banned. Therefore, this additional cost will occur only a limited number of times over the training run. Computation overhead. In terms of computation, BTARD-SGD introduces two main overheads: from validators and CENTEREDCLIP respectively. As we have shown empirically in Section 4 and Appendix I, both BTARD-SGD and BTARD-CLIPPED-SGD can withstand even attacks with 7 out of 16 peers being Byzantine using only 1 2 validators randomly chosen from 16 peers. As such, the computation overhead for validation is no more than 1/8 of the total compute. As for the CENTEREDCLIP, our algorithm executes the same amount of computation as the original CENTEREDCLIP (Karimireddy et al., 2020), except that now the extra load is distributed evenly across all peers. We provide an empirical evaluation of such overhead in Appendix I.2. Finally, we note that generating a shared vector z from a scalar seed rt (as defined in Algorithm 1) has a negligible cost and can be done with any standard pseudo-random number generator. For instance, generating z for ALBERT-large (the setup from Section 4.2) takes 30 1.2 ms on the same T4 GPU that we use in our experiments. Secure Distributed Training at Scale C. Overview of attack vectors In Section 3.2, we have outlined the four types of Byzantine attacks that can affect BTARD-SGD. Here, we analyze each of these types in detail and provide a list of attacks that fit these types. Gradient attacks. This attack vector encompasses all attacks where Byzantine peers replace their true gradients with something else, but otherwise act normally. With this attack, b Byzantine peers can collectively shift the outputs of CENTEREDCLIP by up to τ b/n in any chosen direction. However, since Byzantine peers will need to commit hash of their incorrect gradients, every honest validator can accuse one of these peers with probability b/n . Aggregation attacks. A similar, but opposite attack type can be attempted when a Byzantine peer performs gradient aggregation. Instead of honestly computing CENTEREDCLIP, an attacker may modify the returned vector to incorporate the same kinds of changes as in gradient attacks (see above). This time, the maximum difference that can be applied through such attacks is larger, but it only affects b/n of vector coordinates that are aggregated by Byzantines. Done naively, such attacks can be detected and banned by the gradient checksum (see L15-17 in Algorithm 1). In order to ensure that the above check passes, Byzantines can misreport their sj i in such a way that P i sj i=0. However, since actual sj i depend only on gk i and ˆgk, these values can be verified by the chosen validators, and, in case of mismatch, reported via ACCUSE. We rigorously prove this in Appendix D.5. Furthermore, if an honest validator finds that a certain peer has broadcast incorrect sj i, the validator can simultaneously accuse the corresponding Byzantine aggregator j that should have notified about the incorrect sj i (see L12-14 in Algorithm 1). Reputation abuse. Since BTARD-SGD provides means by which benign participants can ban Byzantine attackers, it is important to ensure that the same means cannot be exploited by Byzantine peers to eliminate benign ones or otherwise abuse the system. There are three potential attack vectors that fit this description: Falsely accusing a benign peer, Persistently calling the ACCUSE procedure to slow down training, Automatically approving gradients without actual validation, In BTARD-SGD, we protect against slander (issues 1. and 2.) by the design of ACCUSE protocol, by which a peer that initiates false allegations will itself be banned. As such, Byzantines can only invoke ACCUSE protocol a limited number of times before they are all permanently banned. In turn, the attack vector (3.) is more effective: if one Byzantine was chosen as validator for another Byzantine, they can automatically report successful validation without negative consequences for either of them. However, since all validators are chosen through MPRNG, an attacker has no way of predicting whether its validator will be benign or Byzantine. Thus, any malicious activity will always have a chance of being caught by an honest validator. Protocol violations. Finally, a Byzantine attacker can deviate from the protocol prescribed by BTARD-SGD in simpler ways ways, for instance: 1. Not committing the hash of its gradient when required by 6, 2. Not sending data to a particular peer when required (or sending data twice), 3. Deliberately broadcasting a hash that mismatches the subsequently sent data, 4. Sending metadata (e.g. gradient norm) that is inconsistent with previously sent gradient part, 5. Sending si that is inconsistent with previously sent gradient, 6. Not validating when chosen as validator, validating when not chosen, or validating a different peer than was chosen by BTARD-SGD. Secure Distributed Training at Scale For protocol deviations that are visible to all benign participants, such as in (1.) or (6.), benign peers can ban the offender instantaneously. However, this is not the case for attacks such as (2.), where the deviation is only visible to one or few peers. As described earlier in Section 3.2, we address this issue with a special procedure that allows any peer to ban any other peer at the cost of also being banned. Thus, if an attacker sends inconsistent gradients, norms or inner products to only one benign peer, that peer can still get the attacker banned even though it wouldn t be able to call ACCUSE. Protecting from attacks 3, 4 and 5 from the above list also relies on this mutual elimination procedure. Specifically, if an attacker sends provably incorrect data to a benign peer, that peer will immediately trigger the mutual elimination procedure. The only exception to this rule is if one Byzantine peer sends incorrect data to another Byzantine peer: this behavior is neither punishable nor, in itself, harmful. In turn, the mutuality of this elimination procedure prevents potential misuse by Byzantines: if an attacker decides to ban someone through this procedure, that attacker will also be banned. D. Detailed algorithm description In this section, we provide more formal versions of the BTARD (Alg. 6) and BTARD-SGD (Alg. 7) algorithms, as well as auxiliary subroutines and further details. We describe our approach in a bottom-up manner. D.1. Basic building blocks We begin with a glossary of basic functions used in the algorithms: broadcast m broadcast the message m to all other peers using Gossip Sub (Vyzovitis et al., 2020) and receive for the respective messages of other peers. m should be signed by the sender s private key (Rivest et al., 1978) before sending. A receiver should ignore messages with an invalid signature and ban a peer in case of receiving two contradicting messages signed by it (e.g., two different hashes for the same iteration and the same stage of the algorithm). SPLIT(v, n) split vector v of size d into n parts. The first d mod n parts are of size d/n and the remaining parts have size d/n . MERGE(v1, . . . , vn) concatenate vectors v1, . . . , vn into one. BAN(peerj) add peer j to a local blocklist, ignore any subsequent messages from that peer, and continue training without it. Note that the honest peers do not need to explicitly coordinate on their decisions to ban someone, because these decisions are made using the broadcasted data only. CHECKCOMPUTATIONS(j) or VALIDATEPEER run COMPUTEGRADIENTS(xt, ξt j) and compare against the cj, h* j, s* j broadcasted by that peer. If there is mismatch, ACCUSE. D.2. Centered Clip and verification of its results An important building block of BTARD is CENTEREDCLIP a robust aggregation rule proposed in Karimireddy et al. (2020). Unlike a number of other aggregation rules as coordinate-wise median, Krum, geometric median, CENTEREDCLIP is provably robust against Byzantine attacks (see Theorem III from Karimireddy et al. (2020) and Lemma E.1). Let G be the set of good peers, B be the set of Byzantine workers, and, for simplicity, let [n] = G B, |B| = δn δ0n < n/2. Assume that we have n random vectors x1, . . . , xn, such that i, j G E[xi] = E[xj] = x, E[ xi xj 2] σ2, and for all i B vectors xi can be arbitrary. CENTEREDCLIP works as follows: it is an iterative procedure generating a sequence {vl}l 0 satisfying vl+1 = vl + 1 i=1 (xi vl) min 1, τl xi vl , (Centered Clip) (1 δ) B2 l/3 + σ2 3δ , B2 l+1 = 6.45δB2 l + 5σ2. (5) Secure Distributed Training at Scale Intuitively, CENTEREDCLIP behaves like the mean for all points within the sphere of radius τ and like the median for outliers . In turn, choosing different values of τ allows one to smoothly interpolate between the mean (τ inf) and the geometric median (τ 0) aggregation rules. The goal of this procedure is natural: find good enough approximation bx of x = 1 |G| P i G xi. Karimireddy et al. (2020) show7 that, for δ 0.1, the sequence {vl}l 0 generated by CENTEREDCLIP satisfies E[ vl x 2] (9.7δ)l3E[ v0 x 2] + 4000δσ2. (6) Moreover, Karimireddy et al. (2020) prove that for all possible aggregation rules producing bx and given δ0, σ there exists such set of vectors x1, . . . , xn and such a partition [n] = G B that E[ bx x 2] = Ω(δσ2). Therefore, CENTEREDCLIP can be seen as an optimal aggregation rule neglecting numerical constants. The usage of CENTEREDCLIP helps the good peer i to produce a good enough approximation of the ideal average of the i-th parts of stochastic gradients among good peers in BTARD. Moreover, since δ 0.1 we have that 6.45δ 0.645 implying that B2 l B2 σ2 when l , and τl τ p σ2/δ. These limits can be easily computed from (5). Next, for l Centered Clip converges to the solution of the following equation: i=1 (xi v) min 1, τ xi v In other words, Centered Clip for large enough l approximates the fixed-point iteration process of solving (7). This property plays a key role in Verification 2 of BTARD. D.3. Protocols for banning Byzantine peers ACCUSE and ELIMINATE are the two protocols by which peers ban Byzantine attackers from training. The ACCUSE protocol is only invoked if there the malicious activity of the target peer can be proven to others. We detail the exact mechanism in Algorithm 4, which is a formal version of Algorithm 3 from Section 3.1. In contrast, ELIMINATE is a mechanism that allows any peer i to ban any other peer j from training without proof but at the cost of peer i also being banned. We have described this protocol earlier as a countermeasure for protocol violations (see Appendix C). Both ACCUSE(i, j) and ELIMINATE(i, j) imply that peer i uses the broadcast channel to declare its intent to ban peer j. Since the broadcast channel does not guarantee the order of receiving these messages, peers should collect all of them during a training step and process them at the end of the step in some specific order (e.g. sorted by (type, public_keyi, public_keyj), where type {ACCUSE, ELIMINATE} and ACCUSE < ELIMINATE). If processing one of the messages results in banning peer p, further messages involving p are ignored regardless of the p s role. This way, it is impossible for a Byzantine to eliminate more than one honest peer along with itself. Peers reach consensus since their decisions on banning someone are based solely on the messages from the broadcast channel (sorted in the common order) and the calculations with identical results. 7In fact, Karimireddy et al. (2020) derive this result for two-staged version of CENTEREDCLIP. One can derive similar result for the original CENTEREDCLIP under the assumption that for all i, j G we have E[ xi xj 4] σ4. Secure Distributed Training at Scale Algorithm 4 ACCUSE (i, j), the formal version of Algorithm 3 Input: accuser i, target j, peer count n, all values exchanged in Algorithm 6 1: Recalculate gk j = COMPUTEGRADIENTS(xk, ξk j ) 2: Split gi into n parts: gi = (gi(1) , . . . , gi(n) ) , gi(j) Rdj for all j [n] 3: 4: for l = 1 . . . n do 5: if hash(gk j ) = ck j or hash(gk j (l)) = hl j then 6: BAN(peerj) // For gradient attack 7: 8: j l =(gl(j) bg(j)) min n 1, τ gl(j) bg(j) 2 9: if gj(l) bg(l) 2 = normjl or j l , zj = sj l or Pn l=1 sj l = 0 then 10: BAN(peerj) // For aggregation attack 11: for o = 1, . . . , n do 12: if peer o approved normjo or so j then 13: BAN(peero) // for covering up the j-th peer s aggregation attack D.4. Butterfly Clip Algorithm 5 provides details on peer-to-peer communication conducted during a BTARD aggregation step. It was outlined earlier in Algorithm 2 from Section 3.1. For simplicity, we assume (here and below) that workers run each line in a synchronous manner (e.g. wait for all peers to broadcast hash(gi) before communicating the actual gradients). In practice, this restriction can be lifted in favor of asynchronous steps with several explicit synchronization barriers, but that would further complicate the pseudo-code. Algorithm 5 BUTTERFLYCLIP for peer i, the formal version of Algorithm 2 Input: rank i, gradients gi Rd 1: Split gi into n parts: gi = (gi(1) , . . . , gi(n) ) , gi(j) Rdj for all j [n] 2: 3: for j = 1, . . . , n do 4: broadcast ci(j) = hash(gi(j)) 5: Send gi(j) to peer j for all j = i 6: Receive gj(i) from peer j for all j = i 7: for j = 1, . . . , n do 8: if hash(gj(i)) = cj(i) then 9: ELIMINATE(i, j) // Signed with peeri private key 10: 11: bg(i) = CENTEREDCLIP(g1(i), g2(i), . . . , gn(i)) 12: 13: broadcast bc(i) = hash(bg(i)) 14: Send bg(i) to each worker 15: Receive bg(j) for all j = i from other workers 16: for j = 1, . . . , n do 17: if hash(bg(j)) = bc(j) then 18: ELIMINATE(i, j) // Signed with peeri private key 19: 20: return MERGE(bg(1), . . . , bg(n)) D.5. Byzantine-tolerant All-Reduce and its verification procedures Algorithm 6 defines a single gradient aggregation step with additional verification procedures needed to reduce the negative influence of Byzantine peers. We explain the motivation for each of these procedures below. Secure Distributed Training at Scale Algorithm 6 Byzantine-Tolerant All-Reduce (BTARD) Input: number of workers n, gradient vectors on the workers g1, g2, . . . , gn Rd, d > n, max > 0 parameter for Verification 3 1: for workers i = 1, . . . , n in parallel do 2: bg = BUTTERFLYCLIP(i, gi) // Described in Algorithm 5 3: 4: Send metadata for verification: 5: Generate r via MPRNG 6: z = GETRANDOMVECTOR(r) 7: for j 1, ..., n do 8: j i=(gi(j) bg(j)) min n 1, τ gi(j) bg(j) 2 9: broadcast sj i = z[j], j i 10: broadcast normij = gi(j) bg(j) 2 11: for l = 1, . . . , n do 12: wlj = min n 1, τ normlj 13: 14: for j = 1, . . . , n do 15: Verification 1: 16: if normji = gj(i) bg(i) 2 then 17: broadcast normji does not mach cj(i) // All recipients should run ACCUSE(i, j) (Algorithm 4) 18: 19: Verification 2: 20: // Peer i knows i j from Centered Clip 21: if si j = zk[j], i j then 22: broadcast sj i does not match cj(i) // All recipients should run ACCUSE(i, j) (Algorithm 4) 23: if Pn i sj i = 0 then 24: // Peer j lied that all sj are correct 25: broadcast bg(j) is wrong // All recipients should run ACCUSE(i, j) (Algorithm 4) 26: 27: Verification 3: 28: broadcast checkij = [ gi(j) bg(j) 2 > max] 29: if P l checklj > n 2 then 30: CHECKAVERAGING(j) 31: return bg Verifications 1 and 2. While good peers always run CENTEREDCLIP, Byzantine peers can arbitrary violate the protocol meaning that they can send an arbitrary vector instead of sending the result of CENTEREDCLIP. Verification 1 and 2 are needed to prevent such violations and make it possible to identify them during the check of computations. First of all, both verifications are split into 2 rounds in order to let the aggregators of the corresponding part accuse those peers who send inconsistent norms or inner products. Next, in theory, we assume that all good peers find exactly the solution of CENTEREDCLIP equaition (7). Therefore, it is possible to compute the weights from (7) for each worker i and each component j knowing only a norm of the difference of corresponding vectors, i.e., one can compute min{1, τ gi(j) bg(i) } by gi(j) bg(i) . That is, if Byzantine peer i sends normij = gi(j) bg(j) , it will be either revealed by j-th worker if j G or it will be revealed with some probability during the subsequent checks of computations. However, Verification 1 is insufficient to prevent malicious behavior: at iteration k Byzantine peer can send gk i (j) such that gk i (j) bgk(j) = (j)f(xk, ξi,k) bgk(j) . If j B, then it can be the case that i-th worker commits the hash of (j)f(xk, ξi,k) and the check of gradient computation will not identify the violation of the protocol. That is why, Verification 2 is required. Secure Distributed Training at Scale GETRANDOMVECTOR is a function that generates a random unit vector z in the space of model parameters. This vector is based on a random seed r obtained from MPRNG. The goal of Verification 2, is to check that CENTEREDCLIP equation (7) holds for the received vector. The idea is simple: if l=1 (gl(i) bg(i)) min 1, τ gl(i) bg(i) then for any zi of an appropriate dimension l=1 gl(i) bg(i), zi min 1, τ gl(i) bg(i) Since zi in BTARD is generated from the uniform distribution on the unit Euclidean sphere, we have P {(8) does not hold & (9) holds} = 0. (10) However, it is impossible to verify (9) explicitly for workers j = i. Therefore, in the algorithm, good workers check l=1 si l = 0, where si l = ( gl(i) bg(i), zi min n 1, τ gl(i) bg(i) o , if l G, , if l B. (11) Unfortunately, Byzantine peers can send arbitrary si l. This can lead to the situations when (11) holds while (9) and, as a consequence, (8) do not. Below, we rigorously show that all possible violations of the protocol that are not detected by verifications of BTARD can be detected by the auxiliary check of computations with some probability. Verification 3. This is an additional verification that serves to limit the potential scope of aggregation attacks (as described in Appendix C). If the result of Centered Clip landed far from too many benign participants, BTARD will verify it by re-running the same aggregation across all peers. While this procedure is costly, our analysis proves that it is has a very small probability of triggering unless some of the peers perform aggregation attacks. In the latter case, verifying the gradient accumulation will root out such attacks and ban the corresponding peers. Check of computations. As we mentioned earlier, it is possible to violate the protocol without being detected by the verifications of BTARD. Therefore, extra checks of computations are required. In particular, after each aggregation in BTARD-SGD 2m workers are selected uniformly at random: m workers check the computations at the previous step of other m workers. That is, each Byzantine peer is checked at iteration k with probability m/n by some good worker (see the proof of Thm. E.2). Consider an arbitrary Byzantine peer j and all possible violations of the protocol at iteration k that are not detected by verifications of BTARD. First of all, we notice that if cj(i) = hash( (i)f(xk, ξj,k)), then it will be detected during the check of computations with some probability8. Moreover, if i G, then j-th worker has to send cj(i) = hash(gj(i)) to avoid ban. Therefore, the only non-trivial case is when i B as well. In this case, j-th worker can commit cj(i) = hash( (i)f(xk, ξj,k)) since it is meaningless for i-th worker to accuse j-th one. Since normij, sj i and bg(i) are known for all i and j, j-th worker has to broadcast normji = (i)f(xk, ξj,k) bg(i) and si j = (i)f(xk, ξj,k) bg(i), zi min n 1, τ (i)f(xk,ξj,k) bg(i) o to avoid the ban during the check of the computations. Therefore, regardless to the choice gj(i), to pass Verification 2 i-th worker should send such bg(i) that l G {j} (i)f(xk, ξl,k) bg(i), zi min 1, τ (i)f(xk, ξl,k) bg(i) l B\{j} si l = 0. In this case, the behavior of the j-th worker along i-th component is equivalent to the behavior of the good one. It means, that to avoid ban during the check of computations, each Byzantine worker l should broadcast normli = (i)f(xk, ξl,k) bg(i) 8Here and below, this means that the attack/violation will be detected iff a non-Byzantine peer is chosen to validate the perpetrator. Secure Distributed Training at Scale and si l = (i)f(xk, ξl,k) bg(i), zi min n 1, τ (i)f(xk,ξl,k) bg(i) o implying that i-th worker should send such bg(i) that l=1 (i)f(xk, ξl,k) bg(i), zi min 1, τ (i)f(xk, ξl,k) bg(i) In view of (10), it implies that bg(i) = CENTEREDCLIP( (i)f(xk, ξ1,k), (i)f(xk, ξ2,k), . . . , (i)f(xk, ξ2,k)), i.e., there are no violations of the protocol along the i-th component. D.6. BTARD-SGD training loop Finally, Algorithm 7 combines all procedures above into a training loop for secure decentralized SGD. Algorithms 6 7 represent a formal version of Algorithm 1 from Section 3.1. Algorithm 7 BTARD-SGD, the formal version of Algorithm 1 Input: x0 starting point, γ stepsize, K number of iterations, {si,k}n,K 1 i,k=0,0 seeds for batches computations 1: C0 = Banned 1 = 2: for k = 0, 1, . . . , K 1 do 3: Worker i computes gk i = ( f(xk, ξi,k), if i Gk \ Ck, , if i Bk \ Ck,, where ξi,k is generated via seed si,k available to every worker 4: 5: bgk, public_infok = BTARD(gk ik 1, gk ik 1, . . . , gk ikak ), where {ik 1, . . . , ik ak} = (Gk Bk) \ Ck 6: // BTARD is described in Algorithm 6 7: 8: Choose 2m workers ck+1 1 , . . . , ck+1 m , uk+1 1 , . . . , uk+1 m uniformly at random without replacement, Ck+1 = {ck+1 1 , . . . , ck+1 m }, Uk+1 = {uk+1 1 , . . . , uk+1 m } 9: Bannedk = CHECKCOMPUTATIONS(Ck+1, Uk+1, public_infok) 10: xk+1 = proj Q(xk γbgk) := argminx Q x (xk γbgk) 11: Gk+1 = Gk \ Bannedk 1 12: Bk+1 = Bk \ Bannedk 1 Secure Distributed Training at Scale E. Convergence analysis: missing proofs and extra details E.1. Preliminaries For convenience, we provide the classical definitions and facts on smooth and strongly convex functions below. Definition E.1 (L-smoothness). We say that function f : Q R, Q Rd is L-smooth if it is differentiable and x, y Q f(x) f(y) L x y . (12) One can show (Nesterov, 2003) that L-smoothness implies x, y Q f(y) f(x) + f(x), y x + L 2 y x 2, (13) x Q f(x) 2 2L (f(x) f ) , (14) where f is a uniform lower bound for f. Definition E.2 (µ-strong convexity). Differentiable function f : Q R, Q Rd is called µ-strongly convex if x, y Q f(y) f(x) + f(x), y x + µ 2 y x 2. (15) E.2. Impossibility of Byzantine-tolerant learning in heterogeneous case Several papers on Byzantine-tolerant optimization consider non-homogeneous setup, when good workers have different local functions (Wu et al., 2020; He et al., 2020). Formally, it means that instead of solving min x Q Rd {f(x) := Eξ D [f(x, ξ)]} , (16) where good peers sample stochastic gradients from the full dataset (i.e., they can sample ξ from D), the following problem is considered: where fi(x) = Eξi Di [f(x, ξi)] and there exists ζ 0 such that for all x Q i G fi(x) f(x) 2 ζ2. (18) However, under ζ-bounded heterogeneity assumption (18) it is impossible in general to solve (17) with any predefined accuracy in the presence of Byzantine peers (He et al., 2020). Moreover, this is true even when trusted Parameter-Server is available. Theorem E.1 (Theorem III from (He et al., 2020)). For any optimization method Alg there exist n functions f1(x), . . . , fn(x) such that at least (1 δ)n of them are good (corresponding workers belong to G), 1-smooth, µ-strongly convex and satisfy (18) such that the output bx of Alg given the access to these n functions has an error at least E f(bx) min x Rd f(x) Ω δζ2 and E f(bx) 2 Ω δζ2 , (19) where the expectation is taken w.r.t. the randomness of Alg. The intuition behind this negative result is as following: since the only assumption on the similarity of good functions is (18), Byzantine peers can shift the gradients by a vector with a norm ζ without being detected. In this case, it is impossible to distinguish good peers from Byzantines but the solution of (17) depends on which workers are good and which are bad. Therefore, the best one can hope for is the convergence to some neighborhood of the solution. The lower bounds from (19) are proportional to δζ2 and cannot be made arbitrary small for given δ and ζ2. It means that the convergence to any predefined accuracy of the solution is impossible to achieve when local loss functions are Secure Distributed Training at Scale ζ-heterogeneous. In this sense, Byzantine-tolerant learning is impossible in the heterogeneous case. Moreover, in some practical applications (e.g., in Federated Learning (Mc Mahan et al., 2017)), ζ from (18) can be large implying that one cannot achieve reasonable accuracy of the solution when δ is not too small (e.g., δ 0.01). Finally, strong convexity parameter µ is typically much smaller than 1 (assuming that the smoothness parameter is 1). In these cases, δζ2/µ can be too large and, as a result, all methods are not converging at all. E.3. Convergence guarantees for BTARD-SGD E.3.1. ON ASSUMPTIONS 3.1 AND 3.2 First of all, Assumption 3.1 holds whenever standard uniformly bounded variance (UBV) assumption is satisfied. Indeed, if Eξ D[ f(x, ξ) f(x) 2] bσ2, then Eξ D[( if(x, ξ) if(x))2] bσ2 for all i = 1, . . . , d, since f(x, ξ) f(x) 2 = Pd i=1( if(x, ξ) if(x))2. This implies that Assumption 3.1 holds with σ2 dbσ2. However, σ2 can be significantly smaller than dbσ2. For example, if the noise in stochastic gradients is isotropic, e.g., Gaussian, then Eξ D[( 1f(x, ξ) 1f(x))2] = . . . = Eξ D[( df(x, ξ) df(x))2], implying that Eξ D[( if(x, ξ) if(x))2] = 1 d Eξ D[( f(x, ξ) f(x))2] bσ2 d for all i = 1, . . . , d. Therefore, in this case, Asssumption 3.1 holds with σ2 = bσ2. Next, it is possible to relax Assumption 3.1 to the classical UBV assumption. Indeed, in our proofs, we use Assumption 3.1 to bound the variance in the blocks of the stochastic gradients, where the blocks of components are chosen for workers to execute BTARD. If these blocks are chosen uniformly at random, i.e., the vector is split into several parts of the given sizes uniformly at random, then it is enough to have E f[S](x, ξ) [S]f(x) 2 sσ2 for a random subset S of {1, . . . , d} such that |S| = s, where expectation is taken w.r.t. ξ and S. To derive inequality (20) from UBV assumption Eξ D[ f(x, ξ) f(x) 2] bσ2 we use tower property of the expectation: E f[S](x, ξ) [S]f(x) 2 = Eξ D ES f[S](x, ξ) [S]f(x) 2 i=1 P{i S}( if(x, ξ) if(x))2 # i=1 ( if(x, ξ) if(x))2 # = s d Eξ D f(x, ξ) f(x) 2 sbσ2 i.e., (20) holds for σ2 = bσ2. Finally, as we show in Lemmas E.2 and E.4, under As. 3.2 Verification 3 at BTARD leads to extra checking of computations with probability 1/n at each iteration when all workers honestly follow the protocol and under a proper choice of max. Therefore, extra computations either appear due to malicious manipulations of Byzantine peers, and lead eventually to the ban for the Byzantine peers who deviate from the protocol, or, when all workers honestly follow the protocol, only once per n iterations on average. There are a number of important machine learning tasks, such as training Res Net-50 on Imagenet (Zhang et al., 2020) and many others image classification problems, where the noise in the stochastic gradient has much lighter (sub-Gaussian) tails. That is, As. 3.2 is reasonable for a large class of practically important problems. Moreover, in Appendix E.4, we also provide an analysis of BTARD-CLIPPED-SGD and RESTARTED-BTARD-CLIPPED-SGD without any assumptions on the tails of the stochastic gradients distribution. E.3.2. QUALITY OF THE AGGREGATION The quality of the aggregation at each iteration of BTARD-SGD significantly affects the rate of the method. That is, properties of egk are highly important for the convergence of BTARD-SGD. This aggregator is obtained via BTARD that Secure Distributed Training at Scale requires to know a tight estimate of the total number of Byzantine workers violating the protocol at iteration k clipping parameter τ depends on this quantity. Therefore, it is natural to start with relatively simple setup when the number of Byzantine workers violating the protocol is known at each iteration. Before we formulate the first result we introduce some useful notations. Let nk be the total number of peers at iteration k, bk be the total number of Byzantine peers at iteration k, bbk be the total number of Byzantine peers violating the protocol at iteration k, and δk = bk nk , bδk = bbk nk m. In view of new notation, we start with the ideal situation when bbk is known for each worker at each iteration k. First of all, it is needed to to estimate the quality of the aggregation for good workers. Lemma E.1 (Theorem IV from Karimireddy et al. (2020)). Let As. 3.1 hold, δ 0.1(n m), and i Gk \ Ck. Assume that bbk is known for each worker at iteration k and δ = bδk is used to compute clipping parameter τl for Centered Clip. If the total number of iterations T of Centered Clip satisfies T log9.7δ δσ2 3E[ v0 gk 2], then E bgk(i) gk(i) 2 | xk 4001bδk σ2 where gk(i) = 1 |Gk\Ck| P j Gk\Ck gk j (i). Proof. The proof follows directly from (6). Unlike the good peers, Byzantine workers can cooperate and shift the result of CENTEREDCLIP in the components they aggregate without being revealed at Verification 2 of BTARD. However, they cannot produce an arbitrary large shifts due to Verification 3. The next lemma estimates the maximal possible magnitude of a shift together with probability of triggering CHECKAVERAGING at iteration k for at least one worker. Lemma E.2. Let As. 3.1 and 3.2 hold, b 0.1(n m), and i Bk \ Ck. Assume that bbk is known for each worker at iteration k, k max = (1+ 2σ nk m and δ = bδk is used to compute clipping parameter τl for Centered Clip. If the total number of iterations T of Centered Clip satisfies T log9.7δ δσ2 3E[ v0 gk 2] and CHECKAVERAGING(i) is not triggered, then E bgk(i) gk(i) 2 | xk 4 (1 + nk m , (22) where gk(i) = 1 |Gk\Ck| P j Gk\Ck gk j (i). Moreover, if bbk = 0 and nk m 170, then bgk(i) = gk(i) and P CHECKAVERAGING is triggered for 1 peer | xk 149 49(nk m). (23) Proof. If CHECKAVERAGING(i) is not triggered at iteration k, then for rk nk m 2 good workers i1, i2, . . . , irk Gk \ Ck Secure Distributed Training at Scale we have gk ij(i) bgk(i) k max. Therefore, due to the independence of gk i , i Gk \ Ck for fixed xk we have E h bgk(i) gk(i) 2 | xki 2E j=1 gk ij(i) j=1 gk ij(i) gk(i) j=1 bgk(i) gk ij(i) 2 + 4E (i)f(xk) gk(i) 2 | xk j=1 gk ij(i) (i)f(xk) 2( k max)2 + 4σ2 |Gk \ Ck| + 8σ2 where we use |Gk \ Ck| rk nk m 2 and f(i)(xk) = E[gk ij | xk]. Finally, let us estimate the probability of triggering CHECKAVERAGING when all workers follow the protocol. In this case, bg(i) = gk(i). Next, due to As. 3.2 and b 0.1(n m) we have gk(i) f(i)(xk) 2 > 1 |Gk \ Ck|2 100 49(nk m)2 and for all j Gk \ Ck gk j (i) f(i)(xk) 2 > Consider the independent random variables ηj, j Gk \ Ck, where ( 1, if gk j (i) f(i)(xk) 2 q 0, otherwise, where xk is fixed. Then, ηj is a Bernoulli random variable with parameter of success q 8/9. Applying Hoeffding s inequality we get that j Gk\Ck ηj nk m 2(nk m) q nk m 9 n m 1.4(n m) = exp 242(nk m) Since for all j Gk \ Ck we have gk(i) gk j (i) 2 gk(i) (i)f(xk) 2 + (i)f(xk) gk j (i) 2 the obtained bounds imply that CHECKAVERAGING is triggered for at least one worker at iteration k with probability not greater than 100 49(nk m) + (nk m) exp 242(nk m) 149 49(nk m), Secure Distributed Training at Scale where we use that exp 242x 3969 1 x2 for all x 170. We notice that Byzantine peers can trigger CHECKAVERAGING by violating the protocol. However, each Byzantine is checked at iteration k with probability p m/n (see Thm. E.2). Therefore, Byzantine workers can trigger only O (bn/m) extra rounds of communications and computations on average via triggering CHECKAVERAGING. In contrast, when there are no Byzantine workers or all workers follow the protocol CHECKAVERAGING is triggered only once per O(n m) iterations that is a negligible communication an computation overhead when n is large. Combining two previous lemmas we get the following result. Lemma E.3. Let As. 3.1 hold and b 0.1(n m). Assume that bbk is known for each worker at iteration k, k max = (1+ 2σ nk m and δ = bδk is used to compute clipping parameter τl for Centered Clip. If the total number of iterations T of Centered Clip satisfies T log9.7δ δσ2 3E[ v0 gk 2] and CHECKAVERAGING is not triggered for any worker, then E bgk gk 2 | xk Cbδkσ2, (24) E bgk 2 | xk 2Cbδkσ2 + 2 f(xk) 2 + 2σ2 n 2b m, (25) where gk = 1 |Gk\Ck| P j Gk\Ck gk j and C = 4001 + 4 (1 + Proof. We have E bgk gk 2 | xk = X i Gk\Ck E bgk(i) gk(i) 2 | xk + X i Bk\Ck E bgk(i) gk(i) 2 | xk (21),(22) (1 bδk)(nk m) 4001bδk σ2 nk m + bδk(nk m) 4 (1 + Next, using the independence of gk j for j Gk \ Ck and fixed xk we derive E bgk 2 | xk 2E bgk gk 2 | xk + 2E gk 2 | xk (24) 2Cbδkσ2 + 2 f(xk) 2 + 2E gk f(xk) 2 | xk 2Cbδkσ2 + 2 f(xk) 2 + 2σ2 2Cbδkσ2 + 2 f(xk) 2 + 2σ2 In view of the definition of (δ, c)-robust aggregator from Karimireddy et al. (2020), the result of BTARD at iteration k is ( bδk, C)-robust. However, we derive this property under assumption that bbk is known to all workers at each iteration k, which is impractical. When bbk is unknown the situation changes dramatically: in general, good peers can only know some upper bound for the fraction of Byzantine peers at iteration k. Unfortunately, if used without bans, this is not enough to converge to any accuracy of the solution since BTARD-SGD is a permutation-invariant algorithm in terms of Karimireddy et al. (2020). Therefore, in this case, we always use CENTEREDCLIP with τl = for all l 0, i.e., good peers compute an exact average. In this settings, even 1 Byzantine worker can significantly shift the average in all parts of the vector. The next lemma quantifies the negative effect of Byzantine workers in this case. Lemma E.4. Let As. 3.1 and 3.2 hold, b 0.1(n m), m (n 2b)/2. Assume that k max = (1+ 2σ nk m and δ = 0 is used to compute clipping parameter τl for Centered Clip. If CHECKAVERAGING is not triggered for any worker, then E bgk gk 2 | xk Cσ21k,v, (26) Secure Distributed Training at Scale E bgk 2 | xk 2Cσ21k,v + 2 f(xk) 2 + 2σ2 n 2b m, (27) where gk = 1 |Gk\Ck| P j Gk\Ck gk j , C = 4 (1 + 3)2 + 4 , and 1k,v is an indicator function of the event that at least 1 Byzantine peer violates the protocol at iteration k. Moreover, if bbk = 0 and nk m 170, then bgk(i) = gk(i) and P CHECKAVERAGING is triggered for 1 peer | xk 149 49(nk m). (28) Proof. If CHECKAVERAGING is not triggered for any worker, then bgk(i) gk(i) 2 ( k max)21k,v for all i (Gk Ck)\Ck implying E bgk gk 2 | xk = X i (Gk Ck)\Ck E bgk(i) gk(i) 2 | xk (nk m) 2 (1 + nk m 1k,v Cσ21k,v. Next, using the independence of gk j for j Gk \ Ck and fixed xk we derive E bgk 2 | xk 2E bgk gk 2 | xk + 2E gk 2 | xk (26) 2Cσ21k,v + 2 f(xk) 2 + 2E gk f(xk) 2 | xk 2Cσ21k,v + 2 f(xk) 2 + 2σ2 2Cσ21k,v + 2 f(xk) 2 + 2σ2 The proof of the final part of the lemma is identical to the proof of the same result from Lemma E.2. E.3.3. NON-CONVEX CASE In this section, we provide the complete statements and the full proofs of the convergence results for BTARD-SGD when the objective function f is smooth, but can be non-convex. We start with the case when the number of attacking Byzantine workers is known at each iteration. Theorem E.2. Let As. 3.1 and As. 3.2 hold, Q = Rd, and f be L-smooth (see Def. E.1) and uniformly lower bounded by f . Moreover, assume that b 0.1(n m), m (n 2b)/2, and the exact number of attacking Byzantine peers is known to all good peers at each iteration. Next, assume that , k max = (1 + 2σ nk m , (29) where 0 = f(x0) f and k max is the parameter for verification 3 at iteration k of BTARD-SGD. Then, we have E[ f(x K) 2] ε2 after K iterations of BTARD-SGD, where and x K is picked uniformly at random from {x0, x1, . . . , x K 1}. Proof. From L-smoothness of f we have f(xk+1) (13) f(xk) + f(xk), xk+1 xk + L 2 xk+1 xk 2 = f(xk) γ f(xk), bgk + Lγ2 Secure Distributed Training at Scale Taking the conditional expectation E[ | xk] from the both sides of the previous inequality we obtain E f(xk+1) | xk f(xk) γ f(xk) 2 γ f(xk), E bgk gk | xk 2 E bgk 2 | xk (25) f(xk) γ 2 f(xk) 2 + γ E bgk gk | xk 2 +CLγ2bδkσ2 + Lγ2 f(xk) + Lγ2σ2 2 (1 2Lγ) f(xk) 2 + γ 2 E bgk gk 2 | xk +CLγ2bδkσ2 + Lγ2σ2 Since γ 1 4L we continue our derivations as E f(xk+1) | xk f(xk) γ 4 f(xk) 2 + γCσ2(1 + Lγ)bδk + Lγ2σ2 4 f(xk) 2 + 2γCσ2bδk + Lγ2σ2 Taking the full expectation from the both sides of the obtained inequality and summing up the results for k = 0, 1, . . . , K 1 we get k=0 E f(xk) 2 4 γK k=0 E f(xk) f(xk+1) + 8Cσ2 = 4 f(x0) E[f(x K)] 4(f(x0) f ) If a Byzantine peer deviates from the protocol at iteration k, it will be detected with some probability pk during the next iteration. One can lower bound this probability as nk = m(1 δk) Therefore, each individual Byzantine worker can violate the protocol no more than 1/p times on average implying that k=0 E f(xk) 2 4(f(x0) f ) γK + 8Cnbσ2 Km(n 2b m) + 4Lγσ2 4(f(x0) f ) γK + 16Cnbσ2 Km(n 2b) + 8Lγσ2 4(f(x0) f ) γK + 160Cnδσ2 7Km + 80Lγσ2 Since x K is picked uniformly at random from {x0, x1, . . . , x K 1} we have E f(x K) 2 4(f(x0) f ) γK + 160Cnδσ2 7Km + 80Lγσ2 Using the stepsize rule Secure Distributed Training at Scale E f(x K) 2 = O L 0 meaning that after iterations BTARD-SGD guarantees E f(x K) 2 ε2. In the main part of the paper, we notice that the rate of BTARD-SGD in the presence of bad workers is asymptotically the same as for SGD without Byzantine peers when ε is sufficiently small9. This phenomenon has a clear intuition. When the target accuracy ε is small, the stepsize γ is also needed to be small enough. However, as we show in Lemmas E.3 and E.4, Byzantine workers can produce only a bounded shift independent of the stepsize. Moreover, they can violate the protocol at only n/m iterations on average. Therefore, the overall impact of Byzantine workers on the convergence of BTARD-SGD decreases when the stepsize γ decreases. Next, we derive the result without assuming that bbk is known to all peers at each iteration. Theorem E.3. Let As. 3.1 and 3.2 hold, Q = Rd, and f be L-smooth (see Def. E.1) and uniformly lower bounded by f . Moreover, assume that b 0.1(n m), m (n 2b)/2, and δ = 0 is used to compute clipping parameter τl for Centered Clip. Next, assume that , k max = (1 + 2σ nk m , (31) where 0 = f(x0) f and k max is the parameter for verification 3 at iteration k of BTARD-SGD. Then, we have E[ f(x K) 2] ε2 after K iterations of BTARD-SGD, where and x K is picked uniformly at random from {x0, x1, . . . , x K 1}. Proof. The proof is almost identical to the proof of Theorem E.2. Following the same steps and using (26) and (27) instead of (24) and (25) respectively we obtain the same sequence of inequalities up to the following change: instead of bδk we should use 1k,v. Therefore, we have k=0 E f(xk) 2 4(f(x0) f ) If a Byzantine peer deviates from the protocol at iteration k, it will be detected with some probability pk during the next iteration. One can lower bound this probability as nk = m(1 δk) That is, each individual Byzantine worker can violate the protocol no more than 1/p times on average. However, even one Byzantine peer can create a shift of the order k max at each part of the resulting vector. Therefore, all Byzantine peers can violate the protocol no more than b/p times on average implying that k=0 E f(xk) 2 4(f(x0) f ) γK + 8Cnbσ2 4(f(x0) f ) γK + 8Cnbσ2 4(f(x0) f ) γK + 8Cnbσ2 Km + 80Lγσ2 9This is true for convex and strongly convex cases as well. Secure Distributed Training at Scale Since x K is picked uniformly at random from {x0, x1, . . . , x K 1} we have E f(x K) 2 4(f(x0) f ) γK + 8Cnbσ2 Km + 80Lγσ2 Using the stepsize rule E f(x K) 2 = O L 0 meaning that after iterations BTARD-SGD guarantees E f(x K) 2 ε2. As we notice in the main part of the paper, the third term of the obtained complexity result is significantly worse than in (30): it is proportional to b instead of δ = b/n. However, (32) is derived without assuming that bbk is known for all workers at each iteration. Moreover, as in (30), the third term in (32) has better dependence on ε than the second term implying that for small enough ε the rate of BTARD-SGD in the presence of bad workers without assuming that bbk is known at each iteration is asymptotically the same as for SGD without Byzantine peers10. E.3.4. CONVEX CASE In this section, we provide the complete statements and the full proofs of the convergence results for BTARD-SGD when the objective function f is smooth and convex. We start with the case when the number of attacking Byzantine workers is known at each iteration. Theorem E.4. Let As. 3.1 and 3.2 hold, Q = Rd, f be L-smooth (see Def. E.1), convex, and x be some optimum of f. Moreover, assume that b 0.1(n m), m (n 2b)/2, and the exact number of attacking Byzantine peers is known to all good peers at each iteration. Next, assume that 7n R2 0 120σ2K , m2R2 0 1440Cσ2n2δ , k max = (1 + 2σ nk m , (33) where R0 x0 x and k max is the parameter for verification 3 at iteration k of BTARD-SGD. Then, we have E[f(x K) f(x )] ε after K iterations of BTARD-SGD, where LR2 0 ε + σ2R2 0 nε2 + n and x K = 1 K PK 1 k=0 . Proof. Lemma E.3 implies E xk+1 x 2 | xk = E xk x γbgk 2 | xk = xk x 2 2γE xk x , bgk | xk + γ2E bgk 2 | xk (25) xk x 2 2γ xk x , f(xk) + 2γ2 f(xk) 2 2γE xk x , bgk gk | xk + 2γ2Cbδkσ2 + 2γ2σ2 10This is true for convex and strongly convex cases as well. Secure Distributed Training at Scale Next, we use convexity (see (15) with µ = 0) and L-smoothness of f: E xk+1 x 2 | xk (14),(15) xk x 2 2γ (1 2Lγ) f(xk) f(x ) 2γE xk x , bgk gk | xk + 2γ2Cσ2 bbk nk m + 2γ2σ2 To estimate the inner product in the right-hand side we apply Cauchy-Schwarz inequality: 2γE xk x , bgk gk | xk 2γ xk x E bgk gk | xk E bgk gk 2 | xk Cσ nk m xk x q n 2b m xk x q Putting all together and using b 0.1(n m), m (n 2b)/2, γ 1/4L, nk m n 2b m, we obtain E xk+1 x 2 | xk xk x 2 γ f(xk) f(x ) 5Cσ n xk x q bbk + 40γ2Cσ2 7n bbk + 40γ2σ2 Taking the full expectation from the both sides of the above inequality and summing up the results for k = 0, 1, . . . , K 1 we derive k=0 E[f(xk) f(x )] 1 K E xk x 2 E xk+1 x 2 + 40γ2σ2 k=0 E xk x q x0 x 2 E[ x K x 2] E [ xk x 2] E[bbk] + 40γ2Cσ2 k=0 E[bbk]. From Jensen s inequality we have f(x K) 1 K PK 1 k=0 f(xk), where x K = 1 K PK 1 k=0 xk. Using this and new notation Rk = xk x , k > 0, R0 x0 x we get 0 γE f(x K) f(x ) R2 0 E[R2 K] K + 40γ2σ2 E [R2 k] E[bbk] + 40γ2Cσ2 k=0 E[bbk] (35) implying (after changing the indices) that E[R2 k] R2 0 + 40γ2σ2k E [R2 l ] E[bbl] + 40γ2Cσ2 l=0 E[bbl] (36) holds for all k 0. In the remaining part of the proof we derive by induction that R2 0 + 40γ2σ2k E [R2 l ] E[bbl] + 40γ2Cσ2 l=0 E[bbl] 2R2 0 (37) Secure Distributed Training at Scale for all k = 0, . . . , K. For k = 0 this inequality trivially holds. Next, assume that it holds for all k = 0, 1, . . . , T 1, T K 1. Let us show that it holds for k = T as well. From (36) and (37) we have that E[R2 k] 2R2 0 for all k = 0, 1, . . . , T 1. Therefore, E[R2 T ] R2 0 + 40γ2σ2T E [R2 l ] E[bbl] + 40γ2Cσ2 R2 0 + 40γ2σ2T E[bbl] + 40γ2Cσ2 l=0 E[bbl]. If a Byzantine peer deviates from the protocol at iteration k, it will be detected with some probability pk during the next iteration. One can lower bound this probability as nk = m(1 δk) Therefore, each individual Byzantine worker can violate the protocol no more than 1/p times on average implying that E[R2 T ] R2 0 + 40γ2σ2T 10CbσR0 m n + 40γ2Cσ2nb = R2 0 + 40γ2σ2T m + 40γ2Cσ2nδ 7n R2 0 120σ2K , m2R2 0 1440Cσ2n2δ we ensure that 40γ2σ2T m + 40γ2Cσ2nδ 7m R2 0 3 + R2 0 3 + R2 0 3 = R2 0, and, as a result, we get E[R2 T ] 2R2 0. Therefore, (37) holds for all k = 0, 1, . . . , K. Together with (35) it implies E f(x K) f(x ) 2R2 0 γK . Next, from our stepsize rule (33) it follows that E f(x K) f(x ) = O LR2 0 K + σR0 meaning that after LR2 0 ε + σ2R2 0 nε2 + n iterations BTARD-SGD guarantees E[f(x K) f(x )] ε. In the convex case, similar observations hold as in the non-convex case. Next, we derive the result without assuming that bbk is known to all peers at each iteration. Theorem E.5. Let As. 3.1 and 3.2 hold, Q = Rd, f be L-smooth (see Def. E.1), convex, and x be some optimum of f. Moreover, assume that b 0.1(n m), m (n 2b)/2, and δ = 0 is used to compute clipping parameter τl for Centered Clip. Next, assume that 7n R2 0 120σ2K , m2R2 0 72Cσ2n2b2 , k max = (1 + 2σ nk m , (38) Secure Distributed Training at Scale where R0 x0 x and k max is the parameter for verification 3 at iteration k of BTARD-SGD. Then, we have E[f(x K) f(x )] ε after K iterations of BTARD-SGD, where K = O LR2 0 ε + σ2R2 0 nε2 + nbσR0 and x K = 1 K PK 1 k=0 . Proof. The proof is almost identical to the proof of Theorem E.4. Following the same steps and using (26) and (27) instead of (24) and (25) respectively we obtain the same sequence of inequalities up to the following change: instead of bδk we should use 1k,v. Therefore, we have E xk+1 x 2 | xk xk x 2 2γ (1 2Lγ) f(xk) f(x ) 2γE xk x , bgk gk | xk + 2γ2Cσ21k,v + 2γ2σ2 2γE xk x , bgk gk | xk 2γ Cσ xk x 1k,v, that result in E xk+1 x 2 | xk xk x 2 γ f(xk) f(x ) Cσ xk x 1k,v + 2γ2Cσ21k,v + 40γ2σ2 Taking the full expectation from the both sides of the above inequality and summing up the results for k = 0, 1, . . . , K 1 we derive k=0 E[f(xk) f(x )] 1 K E xk x 2 E xk+1 x 2 + 40γ2σ2 k=0 E xk x 1k,v + 2γ2Cσ2 k=0 E[1k,v] x0 x 2 E[ x K x 2] E [ xk x 2] E[1k,v] + 2γ2Cσ2 k=0 E[1k,v]. From Jensen s inequality we have f(x K) 1 K PK 1 k=0 f(xk), where x K = 1 K PK 1 k=0 xk. Using this and new notation Rk = xk x , k 0 we get 0 γE f(x K) f(x ) R2 0 E[R2 K] K + 40γ2σ2 E [R2 k] E[1k,v] + 2γ2Cσ2 k=0 E[1k,v] (40) implying (after changing the indices) that E[R2 k] R2 0 + 40γ2σ2k E [R2 l ] E[1l,v] + 2γ2Cσ2 k 1 X l=0 E[1l,v] (41) holds for all k 0. In the remaining part of the proof we derive by induction that R2 0 + 40γ2σ2k E [R2 l ] E[1l,v] + 2γ2Cσ2 k 1 X l=0 E[1l,v] 2R2 0 (42) Secure Distributed Training at Scale for all k = 0, . . . , K. For k = 0 this inequality trivially holds. Next, assume that it holds for all k = 0, 1, . . . , T 1, T K 1. Let us show that it holds for k = T as well. From (41) and (42) we have that E[R2 k] 2R2 0 for all k = 0, 1, . . . , T 1. Therefore, E[R2 T ] R2 0 + 40γ2σ2k E [R2 l ] E[1l,v] + 2γ2Cσ2 T 1 X l=0 E[1l,v] R2 0 + 40γ2σ2k E[1l,v] + 2γ2Cσ2 T 1 X l=0 E[1l,v]. If a Byzantine peer deviates from the protocol at iteration k, it will be detected with some probability pk during the next iteration. One can lower bound this probability as nk = m(1 δk) That is, each individual Byzantine worker can violate the protocol no more than 1/p times on average. However, even one Byzantine peer can create a shift of the order k max at each part of the resulting vector. Therefore, all Byzantine peers can violate the protocol no more than b/p times on average implying that E[R2 T ] R2 0 + 40γ2σ2T 2CσR0 m + 2γ2nb Cσ2 7n R2 0 120σ2K , m2R2 0 72Cσ2n2b2 we ensure that 2CσR0 m + 2γ2nb Cσ2 m R2 0 3 + R2 0 3 + R2 0 3 = R2 0, and, as a result, we get E[R2 T ] 2R2 0. Therefore, (42) holds for all k = 0, 1, . . . , K. Together with (40) it implies E f(x K) f(x ) 2R2 0 γK . Next, from our stepsize rule (38) it follows that E f(x K) f(x ) = O LR2 0 K + σR0 n K + nbσR0 meaning that after K = O LR2 0 ε + σ2R2 0 nε2 + nbσR0 iterations BTARD-SGD guarantees E[f(x K) f(x )] ε. E.3.5. STRONGLY CONVEX CASE: RESTARTED-BTARD-SGD In this section, we provide the complete statements and the full proofs of the convergence results for the restarted version of BTARD-SGD (RESTARTED-BTARD-SGD, Alg. 8) when the objective function f is smooth and strongly convex. Secure Distributed Training at Scale Algorithm 8 RESTARTED-BTARD-SGD Input: x0 starting point, r number of restarts, {γt}r t=1 stepsizes for BTARD-SGD, {Kt}r t=1 number of iterations for BTARD-SGD, {si,k,t}n,K 1,r i,k,t=0,0,0 seeds for batches computations 1: bx0 = x0 2: for t = 1, 2, . . . , r do 3: Run BTARD-SGD (Alg. 7) for Kt iterations with stepsize γt, starting point bxt 1, and seeds for batches com- putations {si,k,t}n,K 1 i,k=0,0. Define bxt as bxt = 1 Kt k=0 xk,t, where x0,t, x1,t, . . . , x Kt,t are the iterates produced by BTARD-SGD. Output: bxr We start with the case when the number of attacking Byzantine workers is known at each iteration. Theorem E.6. Let As. 3.1 and 3.2 hold, Q = Rd, f be L-smooth (see Def. E.1), µ-strongly convex (see Def. E.2), and x be some optimum of f. Moreover, assume that b 0.1(n m), m (n 2b)/2, and the exact number of attacking Byzantine peers is known to all good peers at each iteration. Next, assume that 7n R2 0 120 2tσ2Kt , m2R2 0 1440 2t Cσ2n2δ , k,t max = (1 + nt k m , (43) µ2R2 0 , 48 δσ2 t 2 mµR0 , r = log2 µR2 0 ε where R0 x0 x , k,t max is the parameter for verification 3 at iteration k of BTARD-SGD during the t-th restart, nt k is the total number of workers at iteration k of t-th restart. Then, we have E[f(bxr) f(x )] ε after r restarts of BTARD-SGD and the total number of executed iterations of BTARD-SGD is L µ log µR2 0 ε + σ2 Proof. Theorem E.4 implies that BTARD-SGD with 7n R2 0 120σ2K , m2R2 0 1440Cσ2n2δ E f(x K) f(x ) 2R2 0 γK after K iterations. Therefore, after the first restart we have E[f(bx1) f(x )] 2R2 0 γ1K1 µR2 0 4 . From µ-strong convexity of f and f(x ) = 0 we have 2 bx1 x 2 f(bx1) f(x ) = E[ bx1 x 2] R2 0 2 . Next, assume that we have E[f(bxt) f(x )] µR2 0 2t+1 , E[ bxt x 2] R2 0 2t for some t r 1. Then, Theorem E.4 implies that E[f(bxt+1) f(x ) | xt] 2 bxt x 2 Secure Distributed Training at Scale Taking the full expectation from the both sides of previous inequality we get E[f(bxt+1) f(x )] 2E[ bxt x 2] γt Kt 2R2 0 2tγt Kt µR2 0 2t+2 . From µ-strong convexity of f and f(x ) = 0 we have 2 bxt+1 x 2 f(bxt+1) f(x ) = E[ bxt+1 x 2] R2 0 2t+1 . Therefore, by mathematical induction we have that for all t = 1, . . . , r E[f(bxt) f(x )] µR2 0 2t+1 , E bxt x 2 R2 0 2t . Then, after r = l log2 µR2 0 ε m 1 restarts of BTARD-SGD we have E[f(bxr) f(x )] ε. The total number of iterations executed by BTARD-SGD is ( L µ , σ22t δσ2 t 2 mµR0 L µ r + σ22r δσ2 r 2 mµR0 L µ log µR2 0 ε + σ2 µ2R2 0 µR2 0 ε + n L µ log µR2 0 ε + σ2 In the strongly convex case, similar observations hold as in the non-convex case. Next, we derive the result without assuming that bbk is known to all peers at each iteration. Theorem E.7. Let As. 3.1 and 3.2 hold, Q = Rd, f be L-smooth (see Def. E.1), µ-strongly convex (see Def. E.2), and x be some optimum of f. Moreover, assume that b 0.1(n m), m (n 2b)/2, and δ = 0 is used to compute clipping parameter τl for Centered Clip. Next, assume that 7n R2 0 120 2tσ2Kt , m2R2 0 72 2t Cσ2n2b2 , k,t max = (1 + nt k m , (46) µ2R2 0 , 24 2Cnbσ2 t 2 mµR0 , r = log2 µR2 0 ε where R0 x0 x , k,t max is the parameter for verification 3 at iteration k of BTARD-SGD during the t-th restart, nt k is the total number of workers at iteration k of t-th restart. Then, we have E[f(bxr) f(x )] ε after r restarts of BTARD-SGD and the total number of executed iterations of BTARD-SGD is t=1 Kt = O L µ log µR2 0 ε + σ2 nµε + nbσ m µε Proof. Theorem E.5 implies that BTARD-SGD with 7n R2 0 120σ2K , m2R2 0 72Cσ2n2b2 Secure Distributed Training at Scale E f(x K) f(x ) 2R2 0 γK after K iterations. Therefore, after the first restart we have E[f(bx1) f(x )] 2R2 0 γ1K1 µR2 0 4 . From µ-strong convexity of f and f(x ) = 0 we have 2 bx1 x 2 f(bx1) f(x ) = E[ bx1 x 2] R2 0 2 . Next, assume that we have E[f(bxt) f(x )] µR2 0 2t+1 , E[ bxt x 2] R2 0 2t for some t r 1. Then, Theorem E.5 implies that E[f(bxt+1) f(x ) | xt] 2 bxt x 2 Taking the full expectation from the both sides of previous inequality we get E[f(bxt+1) f(x )] 2E[ bxt x 2] γt Kt 2R2 0 2tγt Kt µR2 0 2t+2 . From µ-strong convexity of f and f(x ) = 0 we have 2 bxt+1 x 2 f(bxt+1) f(x ) = E[ bxt+1 x 2] R2 0 2t+1 . Therefore, by mathematical induction we have that for all t = 1, . . . , r E[f(bxt) f(x )] µR2 0 2t+1 , E bxt x 2 R2 0 2t . Then, after r = l log2 µR2 0 ε m 1 restarts of BTARD-SGD we have E[f(bxr) f(x )] ε. The total number of iterations executed by BTARD-SGD is ( L µ , σ22t µ2R2 0 , nbσ2 t 2 mµR0 µ2R2 0 + nbσ2 r 2 mµR0 L µ log µR2 0 ε + σ2 µ2R2 0 µR2 0 ε + nbσ mµR0 µ log µR2 0 ε + σ2 nµε + nbσ m µε E.4. Convergence guarantees for BTARD-Clipped-SGD The results for BTARD-SGD and RESTARTED-BTARD-SGD rely on As. 3.2 that the stochastic gradients have not too heavy tails, i.e., sub-quadratically decreasing tails. The main reason why it is needed in the analysis is to prevent too often extra computations because of Verification 3 from BTARD when all workers honestly follow the protocol. However, in many important NLP tasks such as BERT training (Zhang et al., 2020), the noise in the stochastic gradient has such a heavy noise that As. 3.2 becomes unnatural. Secure Distributed Training at Scale Algorithm 9 BTARD-CLIPPED-SGD Input: x0 starting point, γ stepsize, K number of iterations, {si,k}n,K 1 i,k=0,0 seeds for batches computations, {λk}K 1 k=0 gradient clipping parameter 1: C0 = Banned 1 = 2: for k = 0, 1, . . . , K 1 do 3: Worker i computes egk i = ( min n 1, λk f(xk,ξi,k) o f(xk, ξi,k), if i Gk \ Ck, , if i Bk \ Ck, , where ξi,k is generated via seed si,k available to every worker 4: 5: bgk, public_infok = BTARD(egk ik 1, gk ik 1, . . . , egk ikak ), where {ik 1, . . . , ik ak} = (Gk Bk) \ Ck 6: 7: Choose 2m workers ck+1 1 , . . . , ck+1 m , uk+1 1 , . . . , uk+1 m uniformly at random without replacement, Ck+1 = {ck+1 1 , . . . , ck+1 m }, Uk+1 = {uk+1 1 , . . . , uk+1 m } 8: Bannedk = CHECKCOMPUTATIONS(Ck+1, Uk+1, public_infok) 9: xk+1 = proj Q(xk γbgk) := argminx Q x (xk γbgk) 10: Gk+1 = Gk \ Bannedk 1 11: Bk+1 = Bk \ Bannedk 1 To handle the problems with heavy-tailed noise distributions we consider BTARD-CLIPPED-SGD (see Algorithm 9) applied to solve (3) such that Q is bounded. Essentially, this algorithm coincides with BTARD-SGD up to the following change: all good peers i Gk \ Ck use clipped stochastic gradients egk i = (egk i (1) , . . . , egk i (nk m) ) , where egk i (l) = min n 1, λk gk i (l) o gk i (l), l = 1, . . . , nk m, and gk i is the stochastic gradient. Next, we introduce the following assumption. Assumption E.1. There exist such constant G > 0, s0 [d], and α (1, 2] that for any set of indices S = (i1, . . . , id), 1 i1 < i2 < . . . < is d, s s0 and arbitrary x Q stochastic gradient f(x, ξ) satisfy E[ f(x, ξ)] = f(x), E [S]f(x, ξ) α s G where [S]f(x, ξ) is defined in As. 3.1. This is a modified version of the assumption used in Zhang et al. (2020). When α < 2 the variance of the stochastic gradient can be unbounded. One can show that in such a regime vanilla SGD can diverge (Zhang et al., 2020). Under As. E.1 we derive the convergence results for convex and strongly convex problems. E.4.1. QUALITY OF THE AGGREGATION Since now we have As. E.1 instead of As. 3.1 and 3.2 it is needed to derive new guarantees for the quality of the aggregation. We start with the following useful lemma about the properties of clipped stochastic gradeints. Lemma E.5 (See also Lemma 9 from Zhang et al. (2020)). Let As. E.1 holds and i, j Gk \ Ck. Then, for all l = 1, 2, . . . , nk m we have E egk i (l) egk j (l) 4 | xk 4λ E gk(l) 2 | xk Gαλ2 α k (nk m) α E[gk(l) | xk] (l)f(xk) 2 G2α (nk m)αλ2(α 1) k , (52) where gk(l) = 1 |Gk\Ck| P i Gk\Ck egk i (l) for all l = 1, . . . , nk m. Secure Distributed Training at Scale Proof. First of all, we derive E egk i (l) egk j (l) 4 | xk = E egk i (l) egk j (l) α egk i (l) egk j (l) 4 α | xk 8λ4 α k E (l)f(xk, ξi,k) α + (l)f(xk, ξj,k) α | xk (49) 16λ4 α k implying (50). Next, for all i Gk \ Ck we have E egk i (l) 2 | xk = E egk i (l) α egk i (l) 2 α | xk λ2 α k E (l)f(xk, ξi,k) α | xk (49) Gαλ2 α k (nk m) α E gk(l) 2 | xk 1 |Gk \ Ck| i Gk\Ck E egk i (l) 2 | xk Gαλ2 α k (nk m) α Finally, for all i Gk \ Ck we derive E[egk i (l) | xk] (l)f(xk) = E[egk i (l) (l)f(xk, ξi,k) | xk] E egk i (l) (l)f(xk, ξi,k) | xk = E h egk i (l) (l)f(xk, ξi,k) 1{ (l)f(xk,ξi,k) λk} | xki E h (l)f(xk, ξi,k) 1{ (l)f(xk,ξi,k) λk} | xki E h (l)f(xk, ξi,k) α 1{ (l)f(xk,ξi,k) λk} | xki λα 1 k (49) Gα E[gk(l) | xk] (l)f(xk) 2 1 |Gk \ Ck| i Gk\Ck E egk i (l) (l)f(xk) 2 | xk (nk m)αλ2(α 1) k . Next, we derive the guarantees for the quality of the aggregation in the case when the number of Byzantine peers violating the protocol bbk is known at each iteration. Lemma E.6. Let As. E.1 hold and b 0.15(n m). Assume that bbk is known for each worker at iteration k, k max = 2λk = 2λ nk m and δ = bδk is used to compute clipping parameter τl for Centered Clip. If the total number of iterations T of Centered Clip satisfies T log0.94 2δσ2 E[ v0 gk 2] and CHECKAVERAGING is not triggered for any worker, then E bgk gk 2 | xk bδk(C1λ 4 α 2 + C2λ2), (53) E bgk 2 | xk 2bδk(C1λ 4 α 2 + C2λ2) + 2Gαλ2 α, (54) where gk = 1 |Gk\Ck| P j Gk\Ck gk j , C1 = 384, and C2 = 4. Secure Distributed Training at Scale Proof. Consider the i-th part of bgk, i.e., consider bgk(i). If i Gk \ Ck, then, in view of (50), we can directly apply Lemma E.1 and get E bgk(i) gk(i) 2 | xk 384bδkλ 4 = 384bδkλ 4 α Next, if i Bk \ Ck, then E bgk(i) gk(i) 2 | xk ( k max)2 = 4λ2 k = 4λ2 Putting all together, we derive E bgk gk 2 | xk = X i Gk\Ck E bgk(i) gk(i) 2 | xk + X i Bk\Ck E bgk(i) gk(i) 2 | xk (1 bδk)(nk m) 384bδkλ 4 α 2 nk m + bδk(nk m) 4λ2 bδk(C1λ 4 α Using (51) we obtain E bgk 2 | xk 2E bgk gk 2 | xk + 2E gk 2 | xk (53) 2bδk(C1λ 4 α 2 + C2λ2) + 2 X i (Gk Bk)\Ck Gαλ2 α k (nk m) α = 2bδk(C1λ 4 α 2 + C2λ2) + 2Gαλ2 α. We notice that Verification 3 can be simplified in the following way: if at least on good peer i notices that egk i (j) bgk(j) > k max = 2λk, then peer i should accuse j-th peer and both are removed from the training process. In this scenario, there is no sense for Byzantine workers in triggering to deviate significantly from the clipped stochastic gradients of the good peers. As for BTARD-SGD, when bbk is unknown we always use CENTEREDCLIP with τl = for all l 0, i.e., good peers compute an exact average. In this settings, even 1 Byzantine worker can significantly shift the average in all parts of the vector. The next lemma quantifies the negative effect of Byzantine workers in this case. Lemma E.7. Let As. E.1 hold and b 0.15(n m). Assume that bbk is known for each worker at iteration k, k max = 2λk = 2λ nk m and δ = bδk is used to compute clipping parameter τl for Centered Clip. If the total number of iterations T of Centered Clip satisfies T log0.94 2δσ2 E[ v0 gk 2] and CHECKAVERAGING is not triggered for any worker, then E bgk gk 2 | xk C2λ21k,v, (55) E bgk 2 | xk 2C2λ21k,v + 2Gαλ2 α, (56) where gk = 1 |Gk\Ck| P j Gk\Ck gk j , C2 = 4, and 1k,v is an indicator function of the event that at least 1 Byzantine peer violates the protocol at iteration k. Proof. For all i (Gk Bk) \ Ck we have E bgk(i) gk(i) 2 | xk ( k max)21k,v = 4λ2 k1k,v = 4λ2 E bgk gk 2 | xk = X i (Gk Bk)\Ck E bgk(i) gk(i) 2 | xk nk m1k,v = C2λ21k,v. Secure Distributed Training at Scale Using (51) we obtain E bgk 2 | xk 2E bgk gk 2 | xk + 2E gk 2 | xk (53) 2C2λ21k,v + 2 X i (Gk Bk)\Ck Gαλ2 α k (nk m) α 2 = 2C2λ21k,v + 2Gαλ2 α. E.4.2. CONVEX CASE In this section, we provide the complete statements and the full proofs of the convergence results for BTARD-CLIPPEDSGD when the objective function f is smooth and convex. We start with the case when the number of Byzantine peers violating the protocol bbk is known at each iteration. Theorem E.8. Let As. E.1 hold, Q is bounded, f be convex, x be some optimum of f, and f(x ) = 0. Moreover, assume that b 0.15(n m), m (n 2b)/2, and the exact number of attacking Byzantine peers is known to all good peers at each iteration. Next, assume that 6GK 1 α , m R0 10δ(C1K 4 α 2α + C2K 2 α ) , k max = 2λk = 2λ nk m, (57) λ = GK 1 α , (58) where R0 x0 x and k max is the parameter for verification 3 at iteration k of BTARD-CLIPPED-SGD. Then, we have E[f(x K) f(x )] ε after K iterations of BTARD-CLIPPED-SGD, where and x K = 1 K PK 1 k=0 . Proof. Non-expansiveness of the projection operator and convexity of f imply xk+1 x 2 = proj Q(xk γbgk) proj Q(x ) 2 xk x γbgk 2 = xk x 2 2γ xk x , bgk + γ2 bgk 2 = xk x 2 2γ xk x , f(xk) 2γ xk x , bgk f(xk) + γ2 bgk 2 xk x 2 2γ f(xk) f(x ) 2γ xk x , bgk f(xk) + γ2 bgk 2. Taking conditional expectation E[ | xk] from the both sides of previous inequality we derive E xk+1 x 2 | xk xk x 2 2γ f(xk) f(x ) 2γE xk x , bgk f(xk) | xk + γ2E bgk 2 | xk (54) xk x 2 2γ f(xk) f(x ) + 2γ2Gαλ2 α 2γ xk x , E bgk gk | xk + 2γ2bδk(C1λ 4 α = xk x 2 2γ f(xk) f(x ) + 2γ2G2K 2 α 2γ xk x , E bgk gk | xk + 2γ2G2(C1K 4 α 2α + C2K 2 α ) nk m bbk. Secure Distributed Training at Scale To estimate the inner product in the right-hand side we apply Cauchy-Schwarz inequality: 2γ xk x , E bgk gk | xk 2γ xk x E bgk gk | xk 2γ xk x E bgk gk | xk E bgk gk 2 | xk (53) 2γ xk x q bδk(C1λ 4 α = 2γG xk x q 2α + C2K 2 α nk m 2α + C2K 2 α ) where in the last inequality we use b 0.15(n m), m (n 2b)/2, γ 1/4L, nk m n 2b m 7 20n. Putting all together we obtain E xk+1 x 2 | xk xk x 2 2γ f(xk) f(x ) + 2γ2G2K 2 α + 2γG xk x q 2α + C2K 2 α ) +40γ2G2(C1K 4 α 2α + C2K 2 α ) 7n bbk. Taking the full expectation from the both sides of the above inequality and summing up the results for k = 0, 1, . . . , T 1 we derive k=0 E[f(xk) f(x )] 1 T E xk x 2 E xk+1 x 2 + 2γ2G2K 2 α 2α + C2K 2 α ) n T k=0 E xk x q +40γ2G2(C1K 4 α 2α + C2K 2 α ) 7n T x0 x 2 E[ x K x 2] K + 2γ2G2K 2 α 2α + C2K 2 α ) n T E [ xk x 2] E h bbk i +40γ2G2(C1K 4 α 2α + C2K 2 α ) 7n T k=0 E[bbk]. From Jensen s inequality we have f(x T ) 1 T PT 1 k=0 f(xk), where x T = 1 T PT 1 k=0 xk. Using this and new notation Rk = xk x , k > 0, R0 x0 x we get 0 2γE f(x T ) f(x ) R2 0 E[R2 T ] T + 2γ2G2K 2 α 2α + C2K 2 α ) n T E [R2 k] E h bbk i +40γ2G2(C1K 4 α 2α + C2K 2 α ) 7n T k=0 E[bbk] (60) Secure Distributed Training at Scale implying (after changing the indices) that E[R2 k] R2 0 + 2γ2G2k K 2 α 2α + C2K 2 α ) n E [R2 l ] E h bbl i +40γ2G2(C1K 4 α 2α + C2K 2 α ) 7n l=0 E[bbl] (61) holds for all k 0. In the remaining part of the proof we derive by induction that R2 0 + 2γ2G2k K 2 α 2α + C2K 2 α ) n E [R2 l ] E h bbl i +40γ2G2(C1K 4 α 2α + C2K 2 α ) 7n l=0 E[bbl] 2R2 0 (62) for all k = 0, . . . , K. For k = 0 this inequality trivially holds. Next, assume that it holds for all k = 0, 1, . . . , T 1, T K 1. Let us show that it holds for k = T as well. From (36) and (37) we have that E[R2 k] 2R2 0 for all k = 0, 1, . . . , T 1. Therefore, E[R2 T ] R2 0 + 2γ2G2TK 2 α 2α + C2K 2 α ) n E [R2 l ] E h bbl i +40γ2G2(C1K 4 α 2α + C2K 2 α ) 7n R2 0 + 2γ2G2TK 2 α α + 4γGR0 q 2α + C2K 2 α ) n +40γ2G2(C1K 4 α 2α + C2K 2 α ) 7n If a Byzantine peer deviates from the protocol at iteration k, it will be detected with some probability pk during the next iteration. One can lower bound this probability as nk = m(1 δk) Therefore, each individual Byzantine worker can violate the protocol no more than 1/p times on average implying that E[R2 T ] R2 0 + 2γ2G2TK 2 α α + 4γGR0n q 2α + C2K 2 α )b +40γ2G2(C1K 4 α 2α + C2K 2 α )nb 7nm T K R2 0 + 2γ2G2K 2 α + 4γGR0n q 2α + C2K 2 α )δ +40γ2G2(C1K 4 α 2α + C2K 2 α )nδ 7m . 6GK 1 α , m R0 10δ(C1K 4 α 2α + C2K 2 α ) Secure Distributed Training at Scale we ensure that 2γ2G2K 2 α + 4γGR0n q 2α + C2K 2 α )δ +40γ2G2(C1K 4 α 2α + C2K 2 α )nδ 7m R2 0 3 + R2 0 3 + R2 0 3 = R2 0 and, as a result, we get E[R2 T ] 2R2 0. Therefore, (62) holds for all k = 0, 1, . . . , K. Together with (60) it implies E f(x K) f(x ) R2 0 γK . Next, from our stepsize rule (57) it follows that E f(x K) f(x ) = O δGR0 m K 1 α meaning that after iterations BTARD-CLIPPED-SGD guarantees E[f(x K) f(x )] ε. If there are no Byzantine peers (δ = 0), the theorem establishes new result for the convergence of CLIPPED-SGD for convex objectives. In the strongly convex case, the theorem recovers the rates that are optimal in this setting as shown in Zhang et al. (2020). Next, when the number of attacking Byzantines is known at each iteration and n δ/m = O(1), the complexity bound is the same as in the case when δ = 0. This means that the negative impact of Byzantine workers is negligible. Finally, the derived theoretical guarantees do not benefit from the increase of the total number of peers n. However, the result holds even for non-smooth problems and it is known that parallelization does not help to improve the complexity bounds in such generality. Nevertheless, our results show that BTARD-CLIPPED-SGD provably converges to any predefined accuracy ε > 0. This is a property that the majority of previous methods does not have (Karimireddy et al., 2020). Next, we derive the result without assuming that bbk is known to all peers at each iteration. Theorem E.9. Let As. E.1 hold, Q is bounded, f be convex, x be some optimum of f, and f(x ) = 0. Moreover, assume that b 0.15(n m), m (n 2b)/2, and δ = 0 is used to compute clipping parameter τl for Centered Clip. Next, assume that 6GK 1 α , m R0 12 2C2Gnb K 1 α , k max = 2λk = 2λ nk m, (63) λ = GK 1 α , (64) where R0 x0 x and k max is the parameter for verification 3 at iteration k of BTARD-CLIPPED-SGD. Then, we have E[f(x K) f(x )] ε after K iterations of BTARD-CLIPPED-SGD, where α α 1 + nb GR0 and x K = 1 K PK 1 k=0 . Proof. The proof is almost identical to the proof of Theorem E.8. Following the same steps and using (55) and (56) instead of (53) and (54) respectively we obtain the same sequence of inequalities up to the following change: instead of bδk we should use 1k,v. Therefore, we have E xk+1 x 2 | xk xk x 2 2γ f(xk) f(x ) + 2γ2G2K 2 α 2γ xk x , E bgk gk | xk + 2γ2C2G2K 2 α 1k,v. Secure Distributed Training at Scale 2γ xk x , E bgk gk | xk 2γG xk x p C2K 1 α 1k,v, E xk+1 x 2 | xk xk x 2 2γ f(xk) f(x ) + 2γ2G2K 2 α C2K 1 α xk x 1k,v + 2γ2C2G2K 2 α 1k,v. Taking the full expectation from the both sides of the above inequality and summing up the results for k = 0, 1, . . . , T 1 we derive k=0 E[f(xk) f(x )] 1 T E xk x 2 E xk+1 x 2 + 2γ2G2K 2 α +2γG C2K 1 α T k=0 E xk x 1k,v + 2γ2C2G2K 2 α T k=0 E[1k,v] x0 x 2 E[ x K x 2] K + 2γ2G2K 2 α +2γG C2K 1 α T E [ xk x 2] E [1k,v] +2γ2C2G2K 2 α T k=0 E[1k,v]. From Jensen s inequality we have f(x T ) 1 T PT 1 k=0 f(xk), where x T = 1 T PT 1 k=0 xk. Using this and new notation Rk = xk x , k > 0, R0 x0 x we get 0 2γE f(x T ) f(x ) R2 0 E[R2 T ] T + 2γ2G2K 2 α +2γG C2K 1 α T E [R2 k] E [1k,v] +2γ2C2G2K 2 α T k=0 E[1k,v] (66) implying (after changing the indices) that E[R2 k] R2 0 + 2γ2G2k K 2 α E [R2 l ] E [1l,v] +2γ2C2G2K 2 α l=0 E[1l,v] (67) holds for all k 0. In the remaining part of the proof we derive by induction that R2 0 + 2γ2G2k K 2 α E [R2 l ] E [1l,v] +2γ2C2G2K 2 α l=0 E[1l,v] 2R2 0 (68) for all k = 0, . . . , K. For k = 0 this inequality trivially holds. Next, assume that it holds for all k = 0, 1, . . . , T 1, T K 1. Let us show that it holds for k = T as well. From (41) and (42) we have that E[R2 k] 2R2 0 for all Secure Distributed Training at Scale k = 0, 1, . . . , T 1. Therefore, E[R2 T ] R2 0 + 2γ2G2TK 2 α E [R2 l ] E [1l,v] +2γ2C2G2K 2 α l=0 E[1l,v] R2 0 + 2γ2G2TK 2 α α + 2γGR0 p +2γ2C2G2K 2 α l=0 E[1l,v] If a Byzantine peer deviates from the protocol at iteration k, it will be detected with some probability pk during the next iteration. One can lower bound this probability as nk = m(1 δk) That is, each individual Byzantine worker can violate the protocol no more than 1/p times on average. However, even one Byzantine peer can create a shift of the order k max at each part of the resulting vector. Therefore, all Byzantine peers can violate the protocol no more than b/p times on average implying that E[R2 T ] R2 0 + 2γ2G2TK 2 α α + 2γGR0 2C2K 1 α nb m + 2γ2C2G2K 2 α nb m . 6GK 1 α , m R0 12 2C2Gnb K 1 α we ensure that 2γ2G2TK 2 α α + 2γGR0 2C2K 1 α nb m + 2γ2C2G2K 2 α nb m R2 0 3 + R2 0 3 + R2 0 3 = R2 0 and, as a result, we get E[R2 T ] 2R2 0. Therefore, (68) holds for all k = 0, 1, . . . , K. Together with (66) it implies E f(x K) f(x ) R2 0 γK . Next, from our stepsize rule (63) it follows that E f(x K) f(x ) = O GR0 meaning that after α α 1 + nb GR0 iterations BTARD-CLIPPED-SGD guarantees E[f(x K) f(x )] ε. That is, when the number of attacking Byzantines is unknown the complexity bound becomes (nb/m) α/(α 1) times worse in comparison to (59). Secure Distributed Training at Scale E.4.3. STRONGLY CONVEX CASE: RESTARTED-BTARD-CLIPPED-SGD In this section, we provide the complete statements and the full proofs of the convergence results for the restarted version of BTARD-CLIPPED-SGD (RESTARTED-BTARD-CLIPPED-SGD, Alg. 8) when the objective function f is smooth and strongly convex. Algorithm 10 RESTARTED-BTARD-CLIPPED-SGD Input: x0 starting point, r number of restarts, {γt}r t=1 stepsizes for BTARD-CLIPPED-SGD, {Kt}r t=1 number of iterations for BTARD-CLIPPED-SGD, {si,k,t}n,K 1,r i,k,t=0,0,1 seeds for batches computations, {λk,t}Kt,r k,t=0,1 gradient clipping parameters 1: bx0 = x0 2: for t = 1, 2, . . . , r do 3: Run BTARD-CLIPPED-SGD (Alg. 9) for Kt iterations with stepsize γt, starting point bxt 1, gradient clipping parameters {λk,t}K 1 k=0 , and seeds for batches computations {si,k,t}n,K 1 i,k=0,0. Define bxt as bxt = 1 Kt k=0 xk,t, where x0,t, x1,t, . . . , x Kt,t are the iterates produced by BTARD-CLIPPED-SGD. Output: bxr We start with the case when the number of attacking Byzantine workers is known at each iteration. Theorem E.10. Let As. E.1 hold, Q is bounded, f be µ-strongly convex (see Def. E.2), x be some optimum of f, and f(x ) = 0. Moreover, assume that b 0.15(n m), m (n 2b)/2, and the exact number of attacking Byzantine peers is known to all good peers at each iteration. Next, assume that 1 α t , m R0 12 2 t 2 Gn q , k,t max = 2λk,t = 2λt p nt k m , (69) 6G 2 t 2 µR0 10δ(C1 + C2)2 t 2 1 α t , (70) r = log2 µR2 0 ε where R0 x0 x and k,t max is the parameter for verification 3 at iteration k of BTARD-CLIPPED-SGD, nt k is the total number of workers at iteration k of t-th restart. Then, we have E[f(bxr) f(x )] ε after r restarts of BTARD-CLIPPED-SGD and the total number of executed iterations of BTARD-CLIPPED-SGD is α 2(α 1 ) + Proof. Theorem E.8 implies that BTARD-CLIPPED-SGD with 6GK 1 α , m R0 10δ(C1K 4 α 2α + C2K 2 α ) E f(x K) f(x ) R2 0 γK after K iterations. Therefore, after the first restart we have E[f(bx1) f(x )] R2 0 γ1K1 µR2 0 4 . Secure Distributed Training at Scale From µ-strong convexity of f and f(x ) = 0 we have 2 bx1 x 2 f(bx1) f(x ) = E[ bx1 x 2] R2 0 2 . Next, assume that we have E[f(bxt) f(x )] µR2 0 2t+1 , E[ bxt x 2] R2 0 2t for some t r 1. Then, Theorem E.8 implies that E[f(bxt+1) f(x ) | xt] bxt x 2 Taking the full expectation from the both sides of previous inequality we get E[f(bxt+1) f(x )] E[ bxt x 2] γt Kt R2 0 2tγt Kt µR2 0 2t+2 . From µ-strong convexity of f and f(x ) = 0 we have 2 bxt+1 x 2 f(bxt+1) f(x ) = E[ bxt+1 x 2] R2 0 2t+1 . Therefore, by mathematical induction we have that for all t = 1, . . . , r E[f(bxt) f(x )] µR2 0 2t+1 , E bxt x 2 R2 0 2t . Then, after r = l log2 µR2 0 ε m 1 restarts of BTARD-CLIPPED-SGD we have E[f(bxr) f(x )] ε. The total number of iterations executed by BTARD-CLIPPED-SGD is G 2 t 2 µR0 δ2 t 2 mµR0 rα 2(α 1) , α α 1 µR2 0 ε ! α α 1 µR2 0 ε α 2(α 1 ) + In the strongly convex case, similar observations hold as in the convex case. Next, we derive the result without assuming that bbk is known to all peers at each iteration. Theorem E.11. Let As. E.1 hold, Q is bounded, f be µ-strongly convex (see Def. E.2), x be some optimum of f, and f(x ) = 0. Moreover, assume that b 0.15(n m), m (n 2b)/2, and δ = 0 is used to compute clipping parameter τl for Centered Clip. Next, assume that 1 α t , m R0 12 2 t 2 Gnb 2C2K , k,t max = 2λk,t = 2λt p nt k m , (73) 6G 2 t 2 µR0 24Gnb 2C22 t 2 mµR0 1 α t , (74) Secure Distributed Training at Scale r = log2 µR2 0 ε where R0 x0 x and k,t max is the parameter for verification 3 at iteration k of BTARD-CLIPPED-SGD, nt k is the total number of workers at iteration k of t-th restart. Then, we have E[f(bxr) f(x )] ε after r restarts of BTARD-CLIPPED-SGD and the total number of executed iterations of BTARD-CLIPPED-SGD is α 2(α 1 ) + nb Proof. Theorem E.9 implies that BTARD-CLIPPED-SGD with 6GK 1 α , m R0 12 2C2Gnb K 1 α E f(x K) f(x ) R2 0 γK after K iterations. Therefore, after the first restart we have E[f(bx1) f(x )] R2 0 γ1K1 µR2 0 4 . From µ-strong convexity of f and f(x ) = 0 we have 2 bx1 x 2 f(bx1) f(x ) = E[ bx1 x 2] R2 0 2 . Next, assume that we have E[f(bxt) f(x )] µR2 0 2t+1 , E[ bxt x 2] R2 0 2t for some t r 1. Then, Theorem E.9 implies that E[f(bxt+1) f(x ) | xt] bxt x 2 Taking the full expectation from the both sides of previous inequality we get E[f(bxt+1) f(x )] E[ bxt x 2] γt Kt R2 0 2tγt Kt µR2 0 2t+2 . From µ-strong convexity of f and f(x ) = 0 we have 2 bxt+1 x 2 f(bxt+1) f(x ) = E[ bxt+1 x 2] R2 0 2t+1 . Therefore, by mathematical induction we have that for all t = 1, . . . , r E[f(bxt) f(x )] µR2 0 2t+1 , E bxt x 2 R2 0 2t . Then, after r = l log2 µR2 0 ε m 1 restarts of BTARD-CLIPPED-SGD we have E[f(bxr) f(x )] ε. The total number of iterations executed by BTARD-CLIPPED-SGD is G 2 t 2 µR0 δ2 t 2 mµR0 rα 2(α 1) , Gnb rα 2(α 1) )! α α 1 µR2 0 ε α 2(α 1) , Gnb α α 1 µR2 0 ε α 2(α 1) )! α 2(α 1 ) + nb Secure Distributed Training at Scale F. Resisting Sybil attacks In this section, we address Byzantine-tolerant training in a setup where new participants can join or leave collaboration midway through training. This requirement arises naturally if a training run relies on volunteers or an open pool of paid participants (Kijsipongse et al., 2018; Ryabinin & Gusev, 2020; Atre et al., 2021; Diskin et al., 2021). In addition to all existing concerns from Section 3, this setup allows Byzantine attackers to assume new identity each time they are blocked. Further yet, Byzantine participants can simultaneously use multiple identities in order to obtain majority in the voting procedure, which is known as Sybil attacks (Douceur, 2002; Trifa & Khemakhem, 2014; Wang & Kangasharju, 2012). In this analysis11, we consider a training run where Byzantine peers collectively possess δ < δmax of all compute resources (we explore the role of δmax < 1/2 later in this section). Intuitively, one can think of this setting as distributed training with n identical computers, δ n of which are controlled by Byzantines. The Byzantine GPUs can be allocated between an arbitrary number of identities. For instance, one accelerator can run full BTARD-SGD protocol for one peer or drop some of the computation and use the freed compute cycles to run computation for another participant. Theoretically, a device can run computation for an arbitrarily large number of peers, as long as it actually computes as many gradients as one benign participant does in the same time-frame. To protect against this new attack type, we augment BTARD-SGD with a reputation system designed to limit the impact of pseudonymous identities with the actual underlying compute. We base this system on the following assumptions: 1. Unique and optimal computations: the gradients computed by peer i at step k cannot be circumvented or reused from other peers and/or previous steps. 2. Sybil-resistance of the message propagation protocol: the underlying message propagation protocol should be resistant to various forms of Sybil attacks: they should not harm the protocol s ability to deliver a broadcasted message to all honest peers. Vyzovitis et al. (2020) empirically show that Gossip Sub used in BTARD-SGD is resistant to such attacks. 3. Usage of digital signatures: peers should have unique public/private key pairs, know each other s public keys, sign their messages (Rivest et al., 1978), and ignore all received messages with invalid signatures. 4. Existence of a cryptographic hash function: peers should have access to a hash function such that finding a vector x satisfying hash(x) = y is infeasible for δ n compute over the entire training duration. We associate each participant with a public record that is used to verify that peer s legitimacy. These records can be securely stored in a Distributed Hash Table (see Appendix G). When a new peer joins the network, it begins with an empty record and is therefore untrusted . Untrusted peers compute gradients normally, but cannot aggregate vectors from others and cannot serve as validators. More importantly, other peers exclude untrusted gradients from aggregation, using them only for the purpose of validating those peers. Each time a peer computes gradients gk i over publicly known batch ξk i , it must write hash(gk i ) to its own public record and sign it with its private key. As in the original BTARD-SGD, some of those entries will be validated by other peers chosen by MPRNG. In turn, the chosen validators will either approve their entry or invoke ACCUSE to ban the peer. In order to become trusted, a given peer must report consecutive gradients until it accumulates T entries approved by (provably) random peers. Here, T is a hyperparameter that should be large enough for the training to recover from any previous attacks and make some progress before previously banned malicious peers can earn trust again. In practice, T may be chosen experimentally by observing the number of iterations it takes to improve the loss upon its pre-attack value in case of the most effective attacks, as reported in Section 4. While T may be application-dependent, we note that its minimal value is small in terms of the relative training time in all our experiments. T corresponding to the 10% of total training time is more than 3 times larger than the worst recovery time for both setups considered in Section 4, where almost a half of the peers are Byzantine. Moreover, Appendix I.1 suggests that recovery from the worst-case attack may happen even faster in case of a smaller share of Byzantines. In that setup (with 20% of peers being Byzantine), T corresponding to the 1% of training time is already enough. 11Note that we only provide rigorous convergence guarantees for the case of the Byzantine attacks. However, a heuristic described in this section helps with resisting the Sybil attacks in practice. Secure Distributed Training at Scale Once a peer becomes trusted, it must continue reporting gradient hashes to maintain trust. Even a single missing or invalidated hash breaks the chain and results in the corresponding peer being banned. To maintain this invariant, peers chosen as a validators add the recalculated hashes into their own record instead of the skipped iteration. To protect against dilution attacks, a cooperative training run can simultaneously consider at most t 2 untrusted peers, where t is the number of currently trusted peers. All subsequent peers should wait in a queue until one of the untrusted peers becomes either trusted or banned. Analysis. Under this formalism, a Sybil attacker will attempt to maximize the number of trusted identities it can control with a limited amount of compute. In the simplest case, an attacker has exactly one GPU that can be used to either run all computations for identity or partial computation for multiple identities. In the latter case, an attacker can honestly compute gradients for identity A with probability p [0, 1] and for identity B with probability 1 p. To breaking the chain, the identity that does not compute gradients at a given step can report arbitrary (e.g. random) entries instead of hash(gk i ). Consider the expected number of trusted identities after enough steps for T validations by honest validators (on average, T n k (1 δ) steps). Identity A becomes trusted with probability p T , otherwise it is banned. Similarly, identitiy B survives with probability (1 p)T . Thus, the expected number of trusted identities after T steps is p T + (1 p)T . For T > 1, this expectation is maximal iff p {0, 1}. Thus, if a peer needs more than one validation to become trusted, the optimal strategy for a Sybil attacker is to fully support one identity instead of spreading the resources between multiple ones. This observation can be generalized for distributing δ n over an m δ n pseudonymous identities, where maximizing the expected number of trusted identities requires fully supporting any δ n identities and disregarding the rest (for T > 1, as before). Overhead computation. When training without Byzantine participants, this modified version of BTARD-SGD requires, on average, T n k additional gradient computations per participant at the very beginning. However, once all peers become trusted, the algorithm computes exactly the same number of gradients as regular BTARD-SGD, effectively training at n k n efficiency of AR-SGD, plus the same communication overhead. Remark 1: Temporary majority. Despite the fact that spreading 1 compute unit across multiple identities reduces the expected number of trusted identities, it may still be useful to establish a temporary majority, albeit with a small probability. For instance, splitting one compute unit evenly among m identities (each with p = 1/m) may result in both m identities temporarily gaining trust with probability: P(peer1 peerm) = 1 m T = m T m (77) A Sybil attacker can simply repeat this procedure on every step until it can establish a temporary majority and use this majority to harm training (e.g. ban non-malicious peers). A natural way to remedy this is to increase T to such an extent that (77) becomes negligibly small. Remark 2: Extra compute for Byzantine nodes. Unlike benign peers, Byzantine attackers do not need to honestly validate each other. When a Byzantine peer is chosen as validator, it can approve its target without actually computing the gradients. In turn, the freed compute resources can be used to support additional Byzantine identities. Thus, if a given training run has n trusted peers and chooses k validators on each step, Sybil attackers can control slightly more than δ n of all identities by using the free compute cycles from validation to support additional peers. Thus, the proposed reputation system requires that the total computational power Bmax available to Byzantines is less than 1 2 by a (typically small) margin that depends on n, k, and T. Remark 3: Perpetual attacks. When training in open collaborations, one cannot ban the Byzantine peers entirely: a Byzantine attacker will always be able to assume a new identity at the cost of running honestly for T n k (1 δ) gradient steps. Thus, unlike in Appendix E, we cannot make BTARD-SGD unbiased by increasing τ. However, as we demonstrated in Secure Distributed Training at Scale Section 4, the biased variant of BTARD-SGD with constant τ can still train real-world deep learning models with the same or virtually the same learning curves as regular SGD. G. Secure distributed hash tables Distributed Hash Tables (DHT) are protocols that establish a decentralized key-value storage over decentralized unreliable participants (Maymounkov & Mazieres, 2002; Balakrishnan et al., 2003; Zhao et al., 2003; Rowstron & Druschel, 2001). To determine which DHT peers are responsible for a given key-value pair, each participant samples a unique binary identifier (ID) sampled uniformly from the space of hash function outputs. When storing a (key, value) on the DHT, one finds k peers whose IDs are nearest to hash(key) and sends the data to each one of those peers. In turn, a peer that wants to read the value or a given key will also search for neighbors whose IDs are close to hash(key) and request the data from those peers. Thus, the data can be accessed as long as at least one o k chosen peers remains active, with some DHT variants introducing additional replication protocols. Our specific implementation is based on Kademlia (Maymounkov & Mazieres, 2002), a popular DHT variant that determines nearest neighbors based on XOR distance function or their IDs: d(x, y) = int(x y). More importantly, Kademlia protocol organizes nodes in such a way that each individual peer only knows a small subset of O(log2 n) direct neighbors, however, it is possible to navigate the neighborhood graph to find the globally nearest neighbors in O(log2 N) network requests. DHT protocols were originally designed for large-scale distributed systems such as Bit Torrent, IPFS and several cryptocurrencies. To maintain integrity in these applications, modern DHT protocols also employ security measures that make them resistant to Byzantine and Sybil attacks (Urdaneta et al., 2011). In our specific scenario, the most sensitive DHT entries are personal records that determine whether or not a given peer is trusted. We protect these records by enforcing that every value stored in the DHT must be signed by their author s digital signature (Rivest et al., 1978). Thus, if a malicious peer attempts to modify a record it was not supposed to, all other peers will be able to detect that and eliminate such peers from the collective. However, digital signature are known to be vulnerable to replay attacks: every time a non-Byzantine peer stores an given key-value pair signed with its private key, a Byzantine eavesdropper can record the signed entry and replay it in future. For ordinary DHTs, this would allow an attacker to revert any key-value pair to its previous state by replaying such pre-recorded messages. Our algorithm protects against replay attacks by associating each key-value pair with a third value denoted as expiration time. Given two entries for the same key, DHT nodes will now prioritize the ones with the latest expiration time and consider it valid up to that time. Furthermore, in order to store a new entry to the DHT, a peer must now sign the entire key-value-expiration tuple. Thus, if a Byzantine peer replays a pre-recorded message, it will not be able to overwrite newer DHT entries that were signed for a more recent expiration time. H. Details of the ALBERT experiment setup In Section 4.2, we pretrain ALBERT (Lan et al., 2019) a self-supervised Transformer model for learning representations of language data. We deliberately choose ALBERT instead of other models like BERT (Devlin et al., 2019) due to its high communication efficiency, which is caused by layerwise weight sharing and embedding layer factorization. In particular, we focus on a communication-efficient model, because the connection speed between the workers can become a noticeable constraint when averaging gradients of models with hundreds of millions of parameters. We train ALBERT-large on sequences of 512 tokens from the Wiki Text-103 (Merity et al., 2017) dataset. The training procedure starts from a random initialization, but the subword vocabulary (Sennrich et al., 2016) is the same as created by the authors of the original ALBERT models. This model is trained with two objectives: masked language modeling (given a sentence with several masked tokens, predict the tokens that were masked) and sentence order prediction (given two segments from the same document, determine if they were swapped). We use LAMB optimizer (You et al., 2020) with batches that contain 4,096 examples, training with a peak learning rate equal to 0,00176 and a warmup of 5,000 gradient descent steps. In addition, we use gradient clipping with a maximum norm of 1 and weight decay regularization with the weight of 0,01. We run distributed training on 16 cloud instances, each equipped with a single Tesla T4 GPU. Each training run takes 2 3 days, depending on the instance availability. Secure Distributed Training at Scale 9500 10000 10500 11000 11500 12000 Test accuracy Training w/o attacks BTARD (ours), =1 BTARD (ours), =10 CClip with PS, =1 Geometric median No defense 9500 10000 10500 11000 11500 12000 Attack: Sign flipping, step 10,000 9500 10000 10500 11000 11500 12000 Attack: Random direction, step 10,000 9500 10000 10500 11000 11500 12000 Test accuracy Attack: Delayed gradients, step 10,000 9500 10000 10500 11000 11500 12000 Attack: Label flipping, step 10,000 9500 10000 10500 11000 11500 12000 Attack: IPM ( = 0.1), step 10,000 9500 10000 10500 11000 11500 12000 Training step Test accuracy Attack: IPM ( = 0.6), step 10,000 9500 10000 10500 11000 11500 12000 Training step Attack: ALIE, step 10,000 9500 10000 10500 11000 11500 12000 Training step Attack: ALIE, step 10,000 BTARD (ours), =1, 2 validators Figure 6. The Res Net-18 test accuracy in the case of various attacks performed at each step starting from step s = 10,000 by 7 Byzantines. I. Additional experiments I.1. Extra evaluations on the CIFAR10 classification task In this section, we perform several additional experiments with BTARD-SGD used to train the Res Net-18 model to solve the CIFAR10 classification task. We start with the same configuration as used in Section 4.1 and consider several changes to this setup. In this section, we omit reporting the behavior of the baselines in presence of the attacks since Figure 3 already shows that the baselines cannot withstand most of them. First, we evaluate our method in case of s = 10,000: we make Byzantines behave honestly prior to step 10,000, then simultaneously attack on each step until they are banned (i.e., attacks start closer to the convergence stage). The results are shown in Figure 6. The behavior of BTARD turns out to be similar to the case of s = 1000, with the same kinds of attacks being most efficient against the stronger and weaker clipping scenarios. BTARD with the stronger clipping and 2 validators is still enough to combat all considered attacks. Next, we explore a situation where Byzantine peers send incorrect gradients periodically, e.g. once per T = 10 iterations. This reduces the attack intensity but allows them to stay undetected for longer. In this setting, we consider 7 Byzantine peers and reuse all parameters from the original setup, except for the new attack period. The attacks are performed at steps s + k T, k N until the attacker is eventually banned. As expected, this setup increases the duration of each attack by a factor of T, but decreases the peak attack influence (see Figure 7). In this case, even BTARD with one validator is enough to combat all kinds of attacks regardless of the clipping strength. Next, we consider a situation where Byzantine peers are less numerous. For this experiment, we use the same configuration as in Section 4.1, but with only 3 Byzantine peers out of 16 (just under 20%). Figure 8 demonstrates similar behavior to our original setup, but with significantly weaker magnitude across all attacks. Here, BTARD with one validator and any clipping strength is also enough to combat all kinds of attacks. Secure Distributed Training at Scale 0 1000 2000 3000 4000 Test accuracy Training w/o attacks BTARD (ours), =1 BTARD (ours), =10 0 1000 2000 3000 4000 Attack: Sign flipping, step 1000 0 1000 2000 3000 4000 Attack: Random direction, step 1000 0 1000 2000 3000 4000 Test accuracy Attack: Delayed gradients, step 1000 0 1000 2000 3000 4000 Attack: Label flipping, step 1000 0 1000 2000 3000 4000 Attack: IPM ( = 0.1), step 1000 0 1000 2000 3000 4000 Training step Test accuracy Attack: IPM ( = 0.6), step 1000 0 1000 2000 3000 4000 Training step Attack: ALIE, step 1000 Figure 7. The Res Net-18 test accuracy in the case of various attacks performed at every T-th step (T = 10) starting from step s = 1000 by 7 Byzantines. Finally, we evaluate the convergence and the final test accuracy of the less computationally intensive variants of BTARDSGD that limit the maximal number of iterations in the Centered Clip procedure to M, where M varies from 1 to 50. In the setup with τ = 1, we observe that M = 50 iterations are always enough for Centered Clip to converge with ϵ = 10 6 in absence of the attacks. Figure 9 demonstrates that stopping the procedure earlier harms the final test accuracy. This negative effect becomes more significant for the smaller values of M. I.2. Evaluating computation overhead in terms of wall time For this analysis, we consider the ALBERT-large training setup from Section 4.2. Our training swarm contains 16 peers with T4 GPUs and 1 Gi B/s network bandwidth. On average over 1000 training steps, the full training step for this model takes up 28.56 seconds. Of this, approximately 23.96 seconds were used up for communication and the remaining 4.60 seconds were spent for gradient aggregation CENTEREDCLIP. Since MPRNG is running in the background, the only part of BTARD that affects the training time is Algorithm 2 (BUTTERFLYCLIP). Thus, we measure the time complexity of this algorithm with different numbers of internal iterations. During normal epochs where all Byzantines remained passive, the algorithm converged in 2 3 iterations for τ = 0.25 and 5 10 iterations with τ = 0.125. We also noticed that this value has temporarily increased by 2 3 times while Byzantine peers were performing their attack. In Table 3, we report the wall time (the mean and the standard deviation over 10 runs) of our algorithm with a different number of iterations in two hardware setups: running on a 8-core VM with 3.1Ghz Intel Xeon 6148 CPU and on a single 1080 Ti GPU. Even the worst case overhead (τ = 0.125, CPU) is less than the 3% of the total step time without attacks and less than the 4% when the attack is active. One important consideration here is that the overhead is constant with respect to the number of peers due to the scaling properties of All-Reduce. Thus, if we train with hundreds of peers, the 0.3 0.6 second overhead can eventually become significant. However, it can be easily offset by moving the CENTEREDCLIP execution to GPU, which at this stage is waiting for the CENTEREDCLIP results anyway. Secure Distributed Training at Scale 0 1000 2000 3000 4000 0.5 Test accuracy Training w/o attacks BTARD (ours), =1 BTARD (ours), =10 0 1000 2000 3000 4000 0.5 Attack: Sign flipping, step 1000 0 1000 2000 3000 4000 0.5 Attack: Random direction, step 1000 0 1000 2000 3000 4000 0.5 Test accuracy Attack: Delayed gradients, step 1000 0 1000 2000 3000 4000 0.5 Attack: Label flipping, step 1000 0 1000 2000 3000 4000 0.5 Attack: IPM ( = 0.1), step 1000 0 1000 2000 3000 4000 Training step Test accuracy Attack: IPM ( = 0.6), step 1000 0 1000 2000 3000 4000 Training step Attack: ALIE, step 1000 Figure 8. The Res Net-18 test accuracy in the case of various attacks performed at each step starting from step s = 1000 by 3 Byzantines. 0 5000 10000 15000 20000 25000 Training step Test accuracy M = 50 M = 10 M = 5 M = 1 Figure 9. Convergence of BTARD-SGD with τ = 1 depending on the maximal number of iterations M in the Centered Clip procedure. Table 3. Computation overhead of BTARD in terms of wall time. No. of iterations Wall time (CPU), sec Wall time (GPU), sec 3 0.362 0.003 0.040 0.002 5 0.430 0.002 0.042 0.002 10 0.601 0.003 0.056 0.005 20 0.943 0.002 0.085 0.009 Secure Distributed Training at Scale 1000 1050 1100 1150 1200 Training step Attack: Random direction, step 1000 BTARD (ours), =0.125 All-Reduce SGD w/o attacks 5000 5050 5100 Training step Attack: Sign flipping, step 5000 Figure 10. The ALBERT-large training objective in the case of BTARD-CLIPPED-SGD (with 31 out of 64 peers being Byzantine) and the standard All-Reduce SGD (without attacks). I.3. Experiments at a larger scale (64 machines) In this section, we evaluate the most effective attacks against BTARD-SGD in the case of a larger number of peers to ensure that our algorithm scales well. We consider the ALBERT-large training setup from Section 4.2 and increase the number of machines to 64 (the largest hardware setup available to us), setting up 31 of them to be Byzantine. To balance the increased number of peers, we divide the individual batch size of each peer by 4 and use 4 validators. Due to the large computation costs, we only evaluate the two most effective strategies for Byzantines based on Figure 4, making only one training run for each of them. We choose the random direction attack starting at step 1000 and the sign flipping attack starting at step 5000. The results are shown in Figure 10. As in our previous experiments, the Byzantine peers manage to temporarily offset the training loss. As in the case with 16 peers, the sign flipping attack at step 5000 obtains the "peak" distortion approximately 20 steps into the attack, and the random direction attack at step 1000 has a longer but less intensive effect. However, BTARD-SGD is able to quickly detect and ban the attackers, banning all 31 Byzantines in 100 150 steps and catching up with the original learning curve after approximately 150 steps (or even temporarily surpassing it). We conclude that BTARD-SGD maintains its efficiency even at this scale.