# byzantinetolerant_methods_for_distributed_variational_inequalities__d0b2a419.pdf Byzantine-Tolerant Methods for Distributed Variational Inequalities Nazarii Tupitsa MBZUAI, MIPT Abdulla Jasem Almansoori MBZUAI Yanlin Wu MBZUAI Martin Takáˇc MBZUAI Karthik Nandakumar MBZUAI Samuel Horváth MBZUAI Eduard Gorbunov Robustness to Byzantine attacks is a necessity for various distributed training scenarios. When the training reduces to the process of solving a minimization problem, Byzantine robustness is relatively well-understood. However, other problem formulations, such as min-max problems or, more generally, variational inequalities, arise in many modern machine learning and, in particular, distributed learning tasks. These problems significantly differ from the standard minimization ones and, therefore, require separate consideration. Nevertheless, only one work [Adibi et al., 2022] addresses this important question in the context of Byzantine robustness. Our work makes a further step in this direction by providing several (provably) Byzantine-robust methods for distributed variational inequality, thoroughly studying their theoretical convergence, removing the limitations of the previous work, and providing numerical comparisons supporting the theoretical findings. 1 Introduction Modern machine learning tasks require to train large models with billions of parameters on huge datasets to achieve reasonable quality. Training of such models is usually done in a distributed manner since otherwise it can take a prohibitively long time [Li, 2020]. Despite the attractiveness of distributed training, it is associated with multiple difficulties not existing in standard training. In this work, we focus on one particular aspect of distributed learning Byzantine tolerance/robustness the robustness of distributed methods to the presence of Byzantine workers2, i.e., such workers that can send incorrect information (maliciously or due to some computation errors/faults) and are assumed to be omniscient. For example, this situation can appear in collaborative training, when several participants (companies, universities, individuals) that do not necessarily know each other train some model together [Kijsipongse et al., 2018, Diskin et al., 2021] or when the devices used in training are faulty [Ryabinin et al., 2021]. When the training reduces to the distributed minimization problem, the question of Byzantine robustness is studied relatively well both in theory and practice [Karimireddy et al., 2022, Lyu et al., 2020]. However, there are a lot of problems that cannot be reduced to minimization, e.g., adversarial training [Goodfellow et al., 2015, Madry et al., 2018], generative adversarial networks (GANs) [Goodfellow et al., 2014], hierarchical reinforcement learning [Wayne and Abbott, 2014, Vezhnevets et al., 2017], adversarial examples games [Bose et al., 2020], and other problems arising in game theory, control theory, and differential equations [Facchinei and Pang, 2003]. Such problems lead to min-max Corresponding author: eduard.gorbunov@mbzuai.ac.ae 2The term Byzantine workers is a standard term for the field [Lamport et al., 1982, Lyu et al., 2020]. We do not aim to offend any group of people but rather use common terminology. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). or, more generally, variational inequality (VI) problems [Gidel et al., 2018] that have significant differences from minimization ones and require special consideration [Harker and Pang, 1990, Ryu and Yin, 2022]. Such problems can also be huge scale, meaning that, in some cases, one has to solve them distributedly. Therefore, similarly to the case of minimization, the necessity in Byzantine-robust methods for distributed VIs arises. The only existing work addressing this problem is [Adibi et al., 2022], where the authors propose the first Byzantine-tolerant distributed method for min-max and VI problems called Robust Distributed Extragradient (RDEG). However, several interesting directions such as application of (δ, c)-robust aggregation rules, client momentum [Karimireddy et al., 2021], and checks of computations [Gorbunov et al., 2022b] studied for minimization problems are left unexplored in the case of VIs. Moreover, [Adibi et al., 2022] prove the convergence to the solution s neighborhood that can be reduced only via increasing the batchsize and rely on the assumption that the number of workers is sufficiently large and the fraction of Byzantine workers is smaller than 1/16, which is much smaller than for SOTA results in minimization case. Our work closes these gaps in the literature and resolves the limitations of the results from [Adibi et al., 2022]. 1.1 Setting To make the further presentation precise, we need to introduce the problem and assumptions we make. We consider the distributed unconstrained variational inequality (non-linear equation) problem3: find x Rd such that F(x ) = 0, where F(x) := 1 i G Fi(x), (1) where G denotes the set of regular/good workers and operators Fi have an expectation form Fi(x) := Eξi[gi(x; ξi)]. We assume that n workers connected with a server take part in the learning/optimization process and [n] = G B, where B is the set of Byzantine workers the subset B of workers [n] that can deviate from the prescribed protocol (send incorrect information, e.g., arbitrary vectors instead of stochastic estimators) either intentionally or not and are omniscient4, i.e., Byzantine workers can know the results of computations on regular workers and the aggregation rule used by the server. The number of Byzantine workers B = |B| is assumed to satisfy B δn, where δ < 1/2 (otherwise Byzantines form a majority and the problem becomes impossible to solve). The number of regular workers is denoted as G = |G|. Assumptions. Here, we formulate the assumptions related to the stochasticity and properties of operators {Fi}i G. Assumption 1. For all i G the stochastic estimator gi(x, ξi) is an unbiased estimator of Fi(x) with bounded variance, i.e., Eξi[gi(x, ξi)] = Fi(x) and for some σ 0 Eξi h gi(x, ξi) Fi(x) 2i σ2. (2) The above assumption is known as the bounded variance assumption. It is classical for the analysis of stochastic optimization methods [Nemirovski et al., 2009, Juditsky et al., 2011] and is used in the majority of existing works on Byzantine robustness with theoretical convergence guarantees. Further, we assume that the data heterogeneity across the workers is bounded. Assumption 2. There exists ζ 0 such that for all x Rd i G Fi(x) F(x) 2 ζ2. (3) Condition (3) is a standard notion of data heterogeneity in Byzantine-robust distributed optimization [Wu et al., 2020, Zhu and Ling, 2021, Karimireddy et al., 2022, Gorbunov et al., 2023a]. It is worth 3We assume that the problem (1) has a unique solution x . This assumption can be relaxed, but for simplicity of exposition, we enforce it. 4This assumption gives Byzantine workers a lot of power and rarely holds in practice. Nevertheless, if the algorithm is robust to such workers, then it is provably robust to literally any type of workers deviating from the protocol. mentioning that without any kind of bound on the heterogeneity of {Fi}i G, it is impossible to tolerate Byzantine workers. In addition, homogeneous case (ζ = 0) is also very important and arises in collaborative learning, see [Kijsipongse et al., 2018, Diskin et al., 2021]. Finally, we formulate here several assumptions on operator F. Each particular result in this work relies only on a subset of listed assumptions. Assumption 3. Operator F : Rd Rd is L-Lipschitz, i.e., F(x) F(y) L x y , x, y Rd. (Lip) Assumption 4. Operator F : Rd Rd is µ-quasi strongly monotone, i.e., for µ 0 F(x), x x µ x x 2, x Rd. (QSM) Assumption 5. Operator F : Rd Rd is monotone, i.e., F(x) F(y), x y 0, x, y Rd. (Mon) Assumption 6. Operator F : Rd Rd is ℓ-star-cocoercive, i.e., for ℓ 0 F(x), x x 1 ℓ F(x) 2, x Rd. (SC) Assumptions 3 and 5 are quite standard for the literature on VIs. Assumptions 4 and 6 can be seen as structured non-monotonicity assumptions. Indeed, there exist examples of non-monotone (and even non-Lipschitz) operators such that Assumptions 4 and 6 holds [Loizou et al., 2021]. However, Assumptions 3 and 5 imply neither (QSM) nor (SC). It is worth mentioning that Assumption 4 is also known under different names, i.e., strong stability [Mertikopoulos and Zhou, 2019] and strong coherent [Song et al., 2020] conditions. Robust aggregation. We use the formalism proposed by Karimireddy et al. [2021, 2022]. Definition 1.1 ((δ, c)-RAGG [Karimireddy et al., 2021, 2022]). Let there exist a subset G of random vectors {y1, . . . , yn} such that G (1 δ)n for some δ < 1/2 and E yi yj 2 ρ2 for any fixed pair i, j G and some ρ 0. Then, by = RAGG(y1, . . . , yn) is called (δ, c)-robust aggregator if for some constant c 0 E h by y 2i cδρ2, (4) where y = 1 i G yi. Further, if the value of ρ is not used to compute by, then by is called agnostic (δ, c)-robust aggregator and denoted as by = ARAGG(y1, . . . , yn). The above definition is tight in the sense that for any estimate by the best bound one can guarantee is E h by y 2i = Ω(δρ2) [Karimireddy et al., 2021]. Moreover, there are several examples of (δ, c)-robust aggregation rules that work well in practice; see Appendix B. Another important concept for Byzantine-robust learning is the notion of permutation inveriance. Definition 1.2 (Permutation invariant algorithm). Define the set of stochastic gradients computed by each of the n workers at some round t to be [ g1,t, . . . , gn,t]. For a good worker i G, these represent the true stochastic gradients whereas for a bad worker j B, these represent arbitrary vectors. The output of any optimization algorithm ALG is a function of these gradients. A permutation-invariant algorithm is one which for any set of permutations over t rounds {π1, . . . , πt}, its output remains unchanged if we permute the gradients. [ g1,1, ..., gn,1], ... [ g1,t, ..., gn,t] [ gπ1(1),1, ..., gπ1(n),1], ... [ gπt(1),t, ..., gπt(n),t] As Karimireddy et al. [2021] prove, any permutation-invariant algorithm fails to converge to any predefined accuracy of the solution (under Assumption 1) even if all regular workers have the same operators/functions, i.e., even when ζ = 0. Table 1: Summary of known and new complexity results for Byzantine-robust methods for distributed variational inequalities. Column Setup indicates the varying assumptions. By the complexity, we mean the number of stochastic oracle calls needed for a method to guarantee that Metric ε (for RDEG P{Metric ε} 1 δRDEG, δRDEG (0, 1]) and Metric is taken from the corresponding column. For simplicity, we omit numerical and logarithmic factors in the complexity bounds. Column BS indicates the minimal batch-size used for achieving the corresponding complexity. Notation: c, δ are robust aggregator parameters; α = momentum parameter; β = ratio of inner and outer stepsize in SEG-like methods; n = total numbers of peers; m = number of checking peers; G = number of peers following the protocol; R = any upper bound on x0 x ; µ = quasi-strong monotonicity parameter; ℓ= star-cocoercivity parameter; L = Lipschitzness parameter; σ2 = bound on the variance. The definition x T can vary; see corresponding theorems for the exact formulas. Setup Method Citation Metric Complexity BS SGDA-RA Cor. 1 E[ x T x 2] ℓ µ + 1 cδn cδσ2 µ2ε M-SGDA-RA Cor. 4 ℓ µα2 + 1 cδαn cδσ2 α2µ2ε SGDA-CC Cor. 6 ℓ µ + σ2 µ2nε + σ2n2 µ2mε + σ2n2 R-SGDA-CC Cor. 8 ℓ µ + σ2 nµε + n2σ m µε 1 SEG-RA Cor. 3 E[ x T x 2] L βµ + 1 βcδG + 1 β cδσ2 βµ2ε SEG-CC Cor. 9 L µ + 1 β + σ2 β2µ2nε + σ2n2 βµ2mε + σ2n2 R-SEG-CC Cor. 11 L µ + σ2 nµε + n2σ m µε 1 Lip, QSM RDEG Adibi et al. [2022](1) x T x 2 L µ σ2µ2R2 (1) consider only homogeneous case (ζ = 0) . 1.2 Our Contributions Now we are ready to describe the main contributions of this work. Methods with provably robust aggregation. We propose new methods called Stochastic Gradient Descent-Ascent and Stochastic Extragradient with Robust Aggregation (SGDA-RA and SEG-RA) variants of popular SGDA [Dem yanov and Pevnyi, 1972, Nemirovski et al., 2009] and SEG [Korpelevich, 1976, Juditsky et al., 2011]. We prove that SGDA-RA and SEG-RA work with any (δ, c)-robust aggregation rule and converge to the desired accuracy if the batchsize is large enough. In the experiments, we observe that SGDA-RA and SEG-RA outperform RDEG in several cases. Client momentum. As the next step, we add client momentum to SGDA-RA and propose Momentum SGDA-RA (M-SGDA-RA). As it is shown by [Karimireddy et al., 2021, 2022], client momentum helps to break the permutation invariance of the method and ensures convergence to any predefined accuracy with any batchsize for non-convex minimization problems. In the case of star-cocoercive quasi-strongly monotone VIs, we prove the convergence to the neighborhood of the solution; the size of the neighborhood can be reduced via increasing batchsize only similarly to the results for RDEG, SGDA-RA, and SEG-RA. We discuss this limitation in detail and point out the non-triviality of this issue. Nevertheless, we show in the experiments that client momentum does help to achieve better accuracy of the solution. Methods with random checks of computations. Finally, for homogeneous data case (ζ = 0), we propose a version of SGDA and SEG with random checks of computations (SGDA-CC, SEG-CC and their restarted versions R-SGDA-CC and R-SEG-CC). We prove that the proposed methods converge to any accuracy of the solution without any assumptions on the batchsize. This is the first result of this type on Byzantine robustness for distributed VIs. Moreover, when the target accuracy of the solution is small enough, the obtained convergence rates for R-SGDA-CC and R-SEG-CC are not worse than the ones for distributed SGDA and SEG derived in the case of δ = 0 (no Byzantine workers); see the comparison of the convergence rates in Table 1. In the numerical experiments, we consistently observe the superiority of the methods with checks of computations to the previously proposed methods. 1.3 Related Work Byzantine-robust methods for minimization problems. Classical distributed methods like Parallel SGD [Zinkevich et al., 2010] cannot tolerate even one Byzantine worker. The most evident vulnerability of such methods is an aggregation rule (averaging). Therefore, many works focus on designing and application of different aggregation rules to Parallel SGD-like methods [Blanchard et al., 2017, Yin et al., 2018, Damaskinos et al., 2019, Guerraoui et al., 2018, Pillutla et al., 2022]. However, this is not sufficient for Byzantine robustness: there exist particular attacks [Baruch et al., 2019, Xie et al., 2019] that can bypass popular defenses. [Karimireddy et al., 2021] formalize the definition of robust aggregation (see Definition 1.1), show that many standard aggregation rules are non-robust according to that definition, and prove that any permutation-invariant algorithm with a fixed batchsize can converge only to the ball around the solution with algorithm-independent radius. Therefore, more in-depth algorithmic changes are required that also explain why RDEG, SGDA-RA, and SEG-RA are not converging to any accuracy without increasing batchsize. One possible way to resolve this issue is to use client momentum [Karimireddy et al., 2021, 2022] that breaks permutation-invariance and allows for convergence to any accuracy. It is also worth mentioning a recent approach by [Allouah et al., 2023], who propose an alternative definition of robust aggregation to the one considered in this paper, though to achieve the convergence to any accuracy in the homogeneous case [Allouah et al., 2023] apply client momentum like in [Karimireddy et al., 2021, 2022]. Another line of work achieves Byzantine robustness through the variance reduction mechanism [Wu et al., 2020, Zhu and Ling, 2021, Gorbunov et al., 2023a]. Finally, for the homogeneous data case, one can apply validation test [Alistarh et al., 2018, Allen-Zhu et al., 2021] or checks of computations [Gorbunov et al., 2022b]. For the summary of other advances, we refer to [Lyu et al., 2020]. Methods for min-max and variational inequalities problems. As mentioned before, minmax/variational inequalities (VIs) problems have noticeable differences with standard minimization. In particular, it becomes evident from the differences in the algorithms behavior. For example, a direct analog of Gradient Descent for min-max/VIs Gradient Descent-Ascent (GDA) [Krasnosel skii, 1955, Mann, 1953, Dem yanov and Pevnyi, 1972, Browder, 1966] fails to converge for a simple bilinear game. Although GDA converges for a different class of problems (cocoercive/star-cocoercive ones) and its version with alternating steps works well in practice and even provably converges locally [Zhang et al., 2022], many works focus on Extragradient (EG) type methods [Korpelevich, 1976, Popov, 1980] due to their provable convergence for monotone Lipschitz problems and beyond [Tran-Dinh, 2023]. Stochastic versions of GDA and EG (SGDA and SEG) are studied relatively well, e.g., see [Hsieh et al., 2020, Loizou et al., 2021, Mishchenko et al., 2020, Pethick et al., 2023] for the recent advances. On the results from [Adibi et al., 2022]. In the context of Byzantine robustness for distributed minmax/VIs, the only existing work is [Adibi et al., 2022]. The authors propose a method called Robust Distributed Extragradient (RDEG) a distributed version of EG that uses a univariate trimmed-mean estimator from [Lugosi and Mendelson, 2021] for aggregation. This estimator satisfies a similar property to (4) that is shown for δ < 1/16 and large enough n (see the discussion in Appendix B). In contrast, the known (δ, c)-robust aggregation rules allow larger δ, and do not require large n. Despite these evident theoretical benefits, such aggregation rules were not considered in prior works on Byzantine robustness for distributed variational inequalities/min-max problems. 2 Main Results In this section, we describe three approaches proposed in this work and formulate our main results. 2.1 Methods with Robust Aggregation We start with the Stochastic Gradient Descent-Accent with (δ, c)-robust aggregation (SGDA-RA): xt+1 = xt γRAGG(gt 1, . . . , gt n), where gt i = gi(xt, ξt i) i G and gt i = i B, where {gt i}i G are sampled independently. The main result for SGDA-RA is given below. Theorem 1. Let Assumptions 1, 2, 4 and 6 hold. Then after T iterations SGDA-RA (Algorithm 1) with (δ, c)-RAGG and γ 1 2ℓoutputs x T such that E x T x 2 1 γµ T x0 x 2 + 2γσ2 µG + 2γcδ(24σ2 + 12ζ2) µ + cδ(24σ2 + 12ζ2) The first two terms in the derived upper bound are standard for the results on SGDA under Assumptions 1, 4, and 6, e.g., see [Beznosikov et al., 2023]. The third and the fourth terms come from the presence of Byzantine workers and robust aggregation since the existing (δ, c)-robust aggregation rules explicitly depend on δ. The fourth term cannot be reduced without increasing batchsize even when ζ = 0 (homogeneous data case). This is expected since SGDA-RA is permutation invariant. When σ = 0 (regular workers compute full operators), then SGDA-RA converges linearly to the ball centered at the solution with radius O( cδζ/µ) that matches the lower bound from [Karimireddy et al., 2022]. In contrast, the known results for RDEG are derived for homogeneous data case (ζ = 0). The proof of Theorem 1 is deferred to Appendix D.1. Using a similar approach we also propose a version of Stochastic Extragradient method with (δ, c)- robust aggregation called SEG-RA: ext = xt γ1RAGG(gt ξ1, . . . , gt ξn), where gt ξi = gi(xt, ξt i), i G and gt ξi = i B, xt+1 = xt γ2RAGG(gt η1, . . . , gt ηn), where gt ηi = gi(ext, ηt i), i G and gt ηi = i B, where {gt ηi}i G and {gt ηi}i G are sampled independently. Our main convergence result for SEG-RA is presented in the following theorem; see Appendix D.2 for the proof. Theorem 2. Let Assumptions5 1, 2, 3 and 4 hold. Then after T iterations SEG-RA (Algorithm 2) with (δ, c)-RAGG, γ1 1 2µ+2L and β = γ2/γ1 1/4 outputs x T such that E x T x 2 1 µβγ1 T x0 x 2 + 8γ1σ2 µβG + 8cδ(24σ2 + 12ζ2) γ1 Similar to the case of SGDA-RA, the bound for SEG-RA has the term that cannot be reduced without increasing batchsize even in the homogeneous data case. RDEG, which is also a modification of SEG, has the same linearly convergent term, but SEG-RA has a better dependence on the batchsize, needed to obtain the convergence to any predefined accuracy, that is O(ε 1) versus O(ε 2) for RDEG; see Cor. 3. In heterogeneous case when σ = 0, SEG-RA also converges linearly to the ball centered at the solution with radius O( cδζ/µ) that matches the lower bound. 2.2 Client Momentum Next, we focus on the version of SGDA-RA that utilizes worker momentum mt i, i.e., xt+1 = xt γRAGG(mt 1, . . . , mt n), with mt i = (1 α)mt 1 i + αgt i, where gt i = gi(xt, ξt i), i G and gt i = i B and {gt ξi}i G are sampled independently. Our main convergence result for this version called M-SGDA-RA is summarized in the following theorem. Theorem 3. Let Assumptions 1, 2, 4, and 6 hold. Then after T iterations M-SGDA-RA (Algorithm 3) with (δ, c)-RAGG outputs x T such that E h x T x 2i 2 x0 x 2 µγαWT + 8γcδ(24σ2 + 12ζ2) µα2G + 4cδ(24σ2 + 12ζ2) where x T = 1 WT PT t=0 wtbxt, bxt = α 1 (1 α)t+1 Pt j=0(1 α)t jxj, wt = 1 µγα and WT = PT t=0 wt. 5SGDA-based and SEG-based methods are typically analyzed under different assumptions. Although (SC) follows from (Lip) and (QSM) with ℓ= L2/µ, some operators may satisfy (SC) with significantly smaller ℓ. Next, when µ = 0, SGDA is not guaranteed to converge [Gidel et al., 2018], while SEG does Despite the fact that M-SGDA-RA is the first algorithm (for VIs) non-invariant to permutations, it also requires large batches to achieve convergence to any accuracy. Even in the context of minimization, which is much easier than VI, the known SOTA analysis of Momentum-SGD relies in the convex case on the unbiasedness of the estimator that is not available due to a robust aggregation. Nevertheless, we prove6 the convergence to the ball centered at the solution with radius O( cδ(ζ+σ)/αµ); see Appendix D.3. Moreover, we show that M-SGDA-RA outperforms in the experiments other methods that require large batches. 2.3 Random Checks of Computations We start with the Stochastic Gradient Descent-Accent with Checks of Computations (SGDA-CC). At each iteration of SGDA-CC, the server selects m workers (uniformly at random) and requests them to check the computations of other m workers from the previous iteration. Let Vt be the set of workers that verify/check computations, At are active workers at iteration t, and Vt At = . Then, the update of SGDA-CC can be written as xt+1 = xt γgt, if gt = 1 |At| i At gi(xt, ξt i) is accepted, where {gi(xt, ξt i)}i G are sampled independently. The acceptance (of the update) event occurs when the condition gt gi(xt, ξt i) Cσ holds for the majority of workers. If gt is rejected, then all workers re-sample gi(xt, ξt i) until acceptance is achieved. The rejection probability is bounded, as per [Gorbunov et al., 2022b], and can be adjusted by choosing a constant C = O(1). We assume that the server knows the seeds for generating randomness on workers, and thus, verification of computations is possible. Following each aggregation of gi(xt, ξt i)i G, the server selects uniformly at random 2m workers: m workers check the computations at the previous step of the other m workers. For instance, at the (t + 1)-th iteration, the server asks a checking peer i to compute gj(xt, ξt j), where j is a peer being checked. This is possible if all seeds are broadcasted at the start of the training. Workers assigned to checking do not participate in the training while they check and do not contribute to gt. Therefore, each Byzantine peer is checked at each iteration with a probability of m/n by some good worker (see the proof of Theorem 4). If the results are mismatched, then both the checking and checked peers are removed from training. This design ensures that every such mismatch, whether it is caused by honest or Byzantine peers, eliminates at least one Byzantine peer and at most one honest peer (see details in Appendix E.1). It s worth noting that we assume any information is accessible to Byzantines except when each of them will be checked. As such, Byzantine peers can only reduce their relative numbers, which leads us to the main result for SGDA-CC, which is presented below. Theorem 4. Let Assumptions 1, 4 and 6 hold. Then after T iterations SGDA-CC (Algorithm 5) with γ 1 2ℓoutputs x T such that E x T +1 x 2 1 γµ T +1 x0 x 2 + 4γσ2 µ(n 2B m) + 2qσ2n B where q = 2C2 + 12 + 12 n 2B m; q = O(1) since C = O(1). The above theorem (see Appendix E.1 for the proof) provides the first result that does not require large batchsizes to converge to any predefined accuracy. The first and the second terms in the convergence bound correspond to the SOTA results for SGDA [Loizou et al., 2021]. Similarly to the vanilla SGDA, the convergence can be obtained by decreasing stepsize, however, such an approach does not benefit from collaboration, since the dominating term γσ2n B µm (coming from the presence of Byzantine workers) is not inversely dependent on n. Moreover, the result is even worse than for single node SGDA in terms of dependence on n. 6In contrast to Theorems 1-2, the result from Theorem 3 is given for the averaged iterate. We consider the averaged iterate to make the analysis simpler. We believe that one can extend the analysis to the last iterate as well, but we do not do it since we expect that the same problem (the need for large batches) will remain in the last-iterate analysis. To overcome this issue we consider the restart technique for SGDA-CC and propose the next algorithm called R-SGDA-CC. This method consists of r stages. On the t-th stage R-SGDA-CC runs SGDA-CC with γt for Kt iterations from the starting point bxt, which is the output from the previous stage, and defines the obtained point as bxt+1 (see details in Appendix E.2). The main result for R-SGDA-CC is given below. Theorem 5. Let Assumptions 1, 4 and 6 hold. Then, after r = l log2 R2 ε m 1 restarts R-SGDA-CC (Algorithm 6) with γt = min 1 2ℓ, q 6σ22t Kt , q m2R2 72qσ22t B2n2 Kt = max 8ℓ µ , 96σ22t (n 2B m)µ2R2 , 34nσB , where R x0 x , outputs bxr such that E bxr x 2 ε. Moreover, the total number of executed iterations of SGDA-CC is r X t=1 Kt = O ℓ µ log µR2 0 ε + σ2 (n 2B m)µε + n Bσ The above result implies that R-SGDA-CC also converges to any accuracy without large batchsizes (see Appendix E.2 for details). However, as the accuracy tends to zero, the dominant term σ2 (n 2B m)µε inversely depends on the number of workers. This makes R-SGDA-CC benefit from collaboration, as the algorithm becomes more efficient with an increasing number of workers. Moreover, when B and m are small the derived complexity result for R-SGDA-CC matches the one for parallel SGDA [Loizou et al., 2021], which is obtained for the case of no Byzantine workers. Next, we present a modification of Stochastic Extragradient with Checks of Computations (SEG-CC): ext = xt γ1gt ξ, if gt ξ = 1 |At| P i Atgi(xt, ξt i) is accepted, xt+1 = xt γ2gt η, if gt η = 1 |At| P i Atgi(ext, ηt i) is accepted, where {gi(xt, ξt i)}i G and {gi(ext, ηt i)}i G are sampled independently. The events of acceptance gt η or gt ξ happens if gt gi(xt, ξt i) Cσ or gt η gi(ext, ηt i) Cσ holds for the majority of workers. An iteration of SEG-CC actually represents two subsequent iteration of SGDA-CC, so we refer to the beginning of the section for more details. Our main convergence results for SEG-CC are summarized in the following theorem; see Appendix E.3 for the proof. Theorem 6. Let Assumptions 1, 3 and 4 hold. Then after T iterations SEG-CC (Algorithm 7) with γ1 1 2µ+2L and β = γ2/γ1 1/4 outputs x T such that E x T x 2 1 µβγ1 T x0 x 2 + 2σ2 4γ1 βµ2(n 2B m) + γ1qn B where q = 2C2 + 12 + 12 n 2B m; q = O(1) since C = O(1). Similarly to SGDA-CC, SEG-CC does not require large batchsizes to converge to any predefined accuracy and does not benefit of collaboration, though the first two terms correspond to the SOTA convergence results for SEG under bounded variance assumption [Juditsky et al., 2011]. The last term appears due to the presence of the Byzantine workers. The restart technique can also be applied; see Appendix E.4 for the proof. Theorem 7. Let Assumptions 1, 3, 4 hold. Then, after r = l log2 R2 ε m 1 restarts R-SEG-CC (Algotithm 8) with γ1t = min 1 2L, q 16σ22t Kt , q m R2 8qσ22t Bn min 1 4L, q m2R2 64qσ22t B2n2 , q and Kt = max 8L mµR , 256σ22t (G B m)µ2R2 where R x0 x outputs bxr such that E bxr x 2 ε. Moreover, the total number of executed iterations of SEG-CC is r X t=1 Kt = O ℓ µ log µR2 0 ε + σ2 (n 2B m)µε + n Bσ The above result states that R-SEG-CC also converges to any accuracy without large batchsizes; see Appendix E.4. But with accuracy tending to zero (ε 0) the dominating term σ2 (n 2B m)µε inversely depends on the number of workers, hence R-SEG-CC benefits from collaboration. Moreover, when B and m are small the derived complexity result for R-SEG-CC matches the one for parallel/minibatched SEG [Juditsky et al., 2011], which is obtained for the case of no Byzantine workers. 3 Numerical Experiments Quadratic game. To illustrate our theoretical results, we conduct numerical experiments on a quadratic game min y max z 1 s 1 2y A1,iy + y A2,iz 1 2z A3,iz + b 1,iy b 2,iz. The above problem can be re-formulated as a special case of (1) with F defined as follows: i=1 Aix + bi, where x = (y , z ) , bi = (b 1,i, b 2,i) , (7) with symmetric matrices Aj,i s.t. µI Aj,i ℓI, Ai Rd d and bi Rd; see Appendix F for the detailed description. We set ℓ= 100, µ = 0.1, s = 1000 and d = 50. Only one peer checked the computations on each iteration (m = 1). We used RFA (geometric median) with bucketing as an aggregator since it showed the best performance. For approximating the median we used Weiszfeld s method with 10 iterations and parameter ν = 0.1 [Pillutla et al., 2022]. RDEG [Adibi et al., 2022] provably works only if n 100, so here we provide experiments with n = 150, B = 20, γ = 2e 5. We set the parameter α = 0.1 for M-SGDA-RA, and the following parameters for RDEG: αRDEG = 0.06, δRDEG = 0.9 and theoretical value of ϵ; see Appendix F for more experiments. We tested the algorithms under the following attacks: bit flipping (BF), random noise (RN), inner product manipulation (IPM) Xie et al. [2019] and a little is enough (ALIE) Baruch et al. [2019]. Robust Neural Networks training. Let f(u; x, y) be the loss function of a neural network with parameters u Rd given input x Rm and label y. For example, in our experiments, we let f be the cross entropy loss, and {(xi, yi)}N 1 is the MNIST dataset. Now consider the following objective: min u Rd max v Rm 1 N i=1 f(u; xi + v, yi) + λ1 2 v 2 2. (8) This min-max objective adds an extra adversarial noise variable to the input data such that it maximizes the loss, so the neural network should become robust to such noise as it minimizes the loss. We can reformulate this objective as a variational inequality with , Fi(x) = uf(u; xi + v, yi) + λ1u vf(u; xi + v, yi) + λ2v i=1 Fi(x). (9) We let n = 20, B = 4, λ1 = 0, and λ2 = 100. We fix the learning rate to 0.01 and use a batch size of 32. We run the algorithm for 50 epochs and average our results across 3 runs. We test the algorithms under the following attacks: i) bit flipping (BF), ii) label flipping (LF), iii) inner product manipulation (IPM) Xie et al. [2019], and iv) a little is enough (ALIE) Baruch et al. [2019]. We compare our algorithm SGDA-CC against the following algorithms: i) SGDA-RA, ii) M-SGDA-RA, and iii) RDEG Adibi et al. [2022]. We use RFA with bucket size 2 as the robust aggregator. The results are shown in Figure 2. Specifically, we show the validation error on MNIST after each epoch. We can see that SGDA-CC performs the best, followed closely by M-SGDA-RA. 4 Conclusion This paper proposes several new algorithms for Byzantine-robust distributed variational inequalities and provides rigorous theoretical convergence analysis for the developed methods. In particular, we 0 2 4 6 8 Number of gradients 1e5 Distance to the solution ALIE (1e+01), bs=1 RDEG (2e-05) SEGRA (2e-05) MSGDARA (2e-05) SGDARA (2e-05) 0 2 4 6 8 Number of gradients 1e5 Distance to the solution RDEG (2e-05) SEGRA (2e-05) MSGDARA (2e-05) SGDARA (2e-05) 0 2 4 6 8 Number of gradients 1e5 Distance to the solution IPM (1e+01), bs=1 RDEG (2e-05) SEGRA (2e-05) MSGDARA (2e-05) SGDARA (2e-05) 0 2 4 6 8 Number of gradients 1e5 Distance to the solution RN (1e+01), bs=1 RDEG (2e-05) SEGRA (2e-05) MSGDARA (2e-05) SGDARA (2e-05) 0 2 4 6 8 Number of gradients 1e5 Distance to the solution ALIE (1e+01), bs=1 SEGCC (2e-05) SGDACC (2e-05) MSGDARA (2e-05) 0 2 4 6 8 Number of gradients 1e5 Distance to the solution SEGCC (2e-05) SGDACC (2e-05) MSGDARA (2e-05) 0 2 4 6 8 Number of gradients 1e5 Distance to the solution IPM (1e+01), bs=1 SEGCC (2e-05) SGDACC (2e-05) MSGDARA (2e-05) 0 2 4 6 8 Number of gradients 1e5 Distance to the solution RN (1e+01), bs=1 SEGCC (2e-05) SGDACC (2e-05) MSGDARA (2e-05) Figure 1: Error plots for quadratic games experiments under different Byzantine attacks. The first row shows the outperformance of M-SGDA-RA over methods without checks of computations. The second row illustrates advantages of SGDA-CC and SEG-CC. 0 10 20 30 40 50 Epochs Adversarial MNIST Error, attack = BF SGDA-CC SGDA-RA M-SGDA-RA RDEG 0 10 20 30 40 50 Epochs Adversarial MNIST Error, attack = LF SGDA-CC SGDA-RA M-SGDA-RA RDEG 0 10 20 30 40 50 Epochs Adversarial MNIST Error, attack = IPM SGDA-CC SGDA-RA M-SGDA-RA RDEG 0 10 20 30 40 50 Epochs Adversarial MNIST Error, attack = ALIE SGDA-CC SGDA-RA M-SGDA-RA RDEG Figure 2: Error plots for the robust neural network experiment on MNIST under different byzantine attacks (BF, LF, IPM, and ALIE). Each algorithm is shown with a consistent choice of color and style across plots, as indicated in the legends. propose the first methods in this setting that provably converge to any predefined accuracy in the case of homogeneous data. We believe this is an important step towards building a strong theory of Byzantine robustness in the case of distributed VIs. However, our work has several limitations. First of all, one can consider different/more general assumptions about operators [Beznosikov et al., 2023, Gorbunov et al., 2022a, 2023b] in the analysis of the proposed methods. Next, as we mention in the discussion after Theorem 3, our result for M-SGDA-RA requires large batchsizes, and it remains unclear to us whether this requirement can be removed. Finally, the only results that do not require large batchsizes are derived using the checks of computations that create (although small) computation overhead. Obtaining similar results without checks of computations remains an open problem. Addressing these limitations is a prominent direction for future research. Acknowledgments and Disclosure of Funding This work of N. Tupitsa was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138. A. Adibi, A. Mitra, G. J. Pappas, and H. Hassani. Distributed statistical min-max learning in the presence of byzantine agents. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 4179 4184. IEEE, 2022. D. Alistarh, Z. Allen-Zhu, and J. Li. Byzantine stochastic gradient descent. Advances in Neural Information Processing Systems, 31, 2018. Z. Allen-Zhu, F. Ebrahimianghazani, J. Li, and D. Alistarh. Byzantine-resilient non-convex stochastic gradient descent. In International Conference on Learning Representations, 2021. Y. Allouah, S. Farhadkhani, R. Guerraoui, N. Gupta, R. Pinot, and J. Stephan. Fixing by mixing: A recipe for optimal byzantine ml under heterogeneity. In International Conference on Artificial Intelligence and Statistics, pages 1232 1300. PMLR, 2023. M. Baruch, G. Baruch, and Y. Goldberg. A little is enough: Circumventing defenses for distributed learning, 2019. A. Beznosikov, E. Gorbunov, H. Berard, and N. Loizou. Stochastic gradient descent-ascent: Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, pages 172 235. PMLR, 2023. P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in Neural Information Processing Systems, 30, 2017. J. Bose, G. Gidel, H. Berard, A. Cianflone, P. Vincent, S. Lacoste-Julien, and W. Hamilton. Adversarial example games. Advances in neural information processing systems, 33:8921 8934, 2020. F. E. Browder. Existence and approximation of solutions of nonlinear variational inequalities. Proceedings of the National Academy of Sciences, 56(4):1080 1086, 1966. Y. Chen, L. Su, and J. Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1 (2):1 25, 2017. G. Damaskinos, E.-M. El-Mhamdi, R. Guerraoui, A. Guirguis, and S. Rouault. Aggregathor: Byzantine machine learning via robust gradient aggregation. Proceedings of Machine Learning and Systems, 1:81 106, 2019. V. F. Dem yanov and A. B. Pevnyi. Numerical methods for finding saddle points. USSR Computational Mathematics and Mathematical Physics, 12(5):11 52, 1972. M. Diskin, A. Bukhtiyarov, M. Ryabinin, L. Saulnier, A. Sinitsin, D. Popov, D. V. Pyrkin, M. Kashirin, A. Borzunov, A. Villanova del Moral, et al. Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879 7897, 2021. F. Facchinei and J.-S. Pang. Finite-dimensional variational inequalities and complementarity problems. Springer, 2003. G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien. A variational inequality perspective on generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/pdf?id=r1la En A5Ym. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014. I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. International Conference on Learning Representations, 2015. E. Gorbunov, D. Kovalev, D. Makarenko, and P. Richtárik. Linearly converging error compensated sgd. Advances in Neural Information Processing Systems, 33:20889 20900, 2020. E. Gorbunov, H. Berard, G. Gidel, and N. Loizou. Stochastic extragradient: General analysis and improved rates. In International Conference on Artificial Intelligence and Statistics, pages 7865 7901. PMLR, 2022a. E. Gorbunov, A. Borzunov, M. Diskin, and M. Ryabinin. Secure distributed training at scale. In International Conference on Machine Learning, pages 7679 7739. PMLR, 2022b. URL https://proceedings.mlr.press/v162/gorbunov22a/gorbunov22a.pdf. E. Gorbunov, S. Horváth, P. Richtárik, and G. Gidel. Variance reduction is an antidote to byzantines: Better rates, weaker assumptions and communication compression as a cherry on the top. International Conference on Learning Representations, 2023a. E. Gorbunov, A. Taylor, S. Horváth, and G. Gidel. Convergence of proximal point and extragradientbased methods beyond monotonicity: the case of negative comonotonicity. In International Conference on Machine Learning, pages 11614 11641. PMLR, 2023b. R. Guerraoui, S. Rouault, et al. The hidden vulnerability of distributed learning in byzantium. In International Conference on Machine Learning, pages 3521 3530. PMLR, 2018. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017. P. T. Harker and J.-S. Pang. Finite-dimensional variational inequality and nonlinear complementarity problems: a survey of theory, algorithms and applications. Mathematical programming, 48(1-3): 161 220, 1990. Y.-G. Hsieh, F. Iutzeler, J. Malick, and P. Mertikopoulos. Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Advances in Neural Information Processing Systems, 33:16223 16234, 2020. A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17 58, 2011. S. P. Karimireddy, L. He, and M. Jaggi. Learning from history for byzantine robust optimization. In International Conference on Machine Learning, pages 5311 5319. PMLR, 2021. S. P. Karimireddy, L. He, and M. Jaggi. Byzantine-robust learning on heterogeneous datasets via bucketing. In International Conference on Learning Representations, 2022. URL https: //arxiv.org/pdf/2006.09365.pdf. E. Kijsipongse, A. Piyatumrong, et al. A hybrid gpu cluster and volunteer computing platform for scalable deep learning. The Journal of Supercomputing, 74(7):3236 3263, 2018. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747 756, 1976. M. A. Krasnosel skii. Two remarks on the method of successive approximations. Uspekhi matematicheskikh nauk, 10(1):123 127, 1955. A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009. L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382 401, 1982. C. Li. Demystifying gpt-3 language model: A technical overview, 2020. "https://lambdalabs. com/blog/demystifying-gpt-3". N. Loizou, H. Berard, G. Gidel, I. Mitliagkas, and S. Lacoste-Julien. Stochastic gradient descentascent and consensus optimization for smooth games: Convergence analysis under expected co-coercivity. Advances in Neural Information Processing Systems, 34:19095 19108, 2021. G. Lugosi and S. Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. 2021. L. Lyu, H. Yu, X. Ma, L. Sun, J. Zhao, Q. Yang, and P. S. Yu. Privacy and robustness in federated learning: Attacks and defenses. ar Xiv preprint ar Xiv:2012.06337, 2020. A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. W. R. Mann. Mean value methods in iteration. Proceedings of the American Mathematical Society, 4 (3):506 510, 1953. P. Mertikopoulos and Z. Zhou. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1):465 507, 2019. K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik, and Y. Malitsky. Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, pages 4573 4582. PMLR, 2020. T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. URL https://openreview.net/forum?id=B1QRgzi T-. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009. T. Pethick, O. Fercoq, P. Latafat, P. Patrinos, and V. Cevher. Solving stochastic weak minty variational inequalities without increasing batch size. In The Eleventh International Conference on Learning Representations, 2023. K. Pillutla, S. M. Kakade, and Z. Harchaoui. Robust aggregation for federated learning. IEEE Transactions on Signal Processing, 70:1142 1154, 2022. L. D. Popov. A modification of the arrow-hurwicz method for search of saddle points. Mathematical notes of the Academy of Sciences of the USSR, 28:845 848, 1980. M. Ryabinin, E. Gorbunov, V. Plokhotnyuk, and G. Pekhimenko. Moshpit sgd: Communicationefficient decentralized training on heterogeneous unreliable devices. Advances in Neural Information Processing Systems, 34:18195 18211, 2021. E. K. Ryu and W. Yin. Large-Scale Convex Optimization: Algorithms & Analyses via Monotone Operators. Cambridge University Press, 2022. C. Song, Z. Zhou, Y. Zhou, Y. Jiang, and Y. Ma. Optimistic dual extrapolation for coherent nonmonotone variational inequalities. Advances in Neural Information Processing Systems, 33: 14303 14314, 2020. S. U. Stich. Unified optimal analysis of the (stochastic) gradient method. ar Xiv preprint ar Xiv:1907.04232, 2019. Q. Tran-Dinh. Sublinear convergence rates of extragradient-type methods: A survey on classical and recent developments. ar Xiv preprint ar Xiv:2303.17192, 2023. A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In International Conference on Machine Learning, pages 3540 3549. PMLR, 2017. G. Wayne and L. Abbott. Hierarchical control using networks trained with higher-level forward models. Neural computation, 26(10):2163 2193, 2014. Z. Wu, Q. Ling, T. Chen, and G. B. Giannakis. Federated variance-reduced stochastic gradient descent with robustness to byzantine attacks. IEEE Transactions on Signal Processing, 68:4583 4596, 2020. C. Xie, S. Koyejo, and I. Gupta. Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation, 2019. Y.-R. Yang and W.-J. Li. Basgd: Buffered asynchronous sgd for byzantine learning. In International Conference on Machine Learning, pages 11751 11761. PMLR, 2021. D. Yin, Y. Chen, R. Kannan, and P. Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5650 5659. PMLR, 2018. G. Zhang, Y. Wang, L. Lessard, and R. B. Grosse. Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization. In International Conference on Artificial Intelligence and Statistics, pages 7659 7679. PMLR, 2022. H. Zhu and Q. Ling. Broadcast: Reducing both stochastic and compression noise to robustify communication-efficient federated learning. ar Xiv preprint ar Xiv:2104.06685, 2021. M. Zinkevich, M. Weimer, L. Li, and A. Smola. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23, 2010. 1 Introduction 1 1.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Main Results 5 2.1 Methods with Robust Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Client Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Random Checks of Computations . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Numerical Experiments 9 4 Conclusion 9 A Examples of (δ, c)-Robust Aggregation Rules 17 A.1 Aggregators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 Bucketing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Robust Aggregation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B Further Details on RDEG 19 C Auxilary results 21 C.1 Basic Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.2 Usefull Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D Methods that use robust aggregators 24 D.1 Proofs for SGDA-RA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D.1.1 Quasi-Strongly Monotone Case . . . . . . . . . . . . . . . . . . . . . . . 24 D.2 Proofs for SEG-RA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D.2.1 Auxilary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.2.2 Quasi-Strongly Monotone Case . . . . . . . . . . . . . . . . . . . . . . . 28 D.3 Proofs for M-SGDA-RA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 D.3.1 Quasi-Strongly Monotone Case . . . . . . . . . . . . . . . . . . . . . . . 30 E Methods with random check of computations 35 E.1 Proofs for SGDA-CC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 E.1.1 Star Co-coercieve Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 E.1.2 Quasi-Strongly Monotone Case . . . . . . . . . . . . . . . . . . . . . . . 39 E.1.3 Monotone Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 E.2 Proofs for R-SGDA-CC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 E.2.1 Quasi-Strongly Monotone Case . . . . . . . . . . . . . . . . . . . . . . . 44 E.3 Proofs for SEG-CC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 E.3.1 Auxilary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 E.3.2 Lipschitz Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 E.3.3 Quasi-Strongly Monotone Case . . . . . . . . . . . . . . . . . . . . . . . 51 E.3.4 Lipschitz Monotone Case . . . . . . . . . . . . . . . . . . . . . . . . . . 56 E.4 Proofs for R-SEG-CC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 E.4.1 Quasi Strongly Monotone Case . . . . . . . . . . . . . . . . . . . . . . . 59 F Extra Experiments and Experimental details 61 F.1 Quadratic games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 F.2 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 A Examples of (δ, c)-Robust Aggregation Rules This section is about how to construct an aggregator satisfying 1.1. A.1 Aggregators This subsection examines various aggregators that lack robustness. It means that new attacks can be easily designed to exploit the aggregation scheme, causing its failure. We analyze three commonly employed defenses that are representative. Krum. For i = j, let i j denote that xj belongs to the n q 2 closest vectors to xi. Then, KRUM(x1, . . . , xn) := argmin i i j xi xj 2 . Krum is computationally expensive, requiring O(n2) work by the server Blanchard et al. [2017]. CM. Coordinate-wise median computes for the k-th coordinate: [CM(x1, . . . , xn)]k := median([x1]k, . . . , [xn]k) = argmin i j=1 |[xi]k [xj]k| . Coordinate-wise median is fast to implement requiring only O(n) time Chen et al. [2017]. RFA. Robust federated averaging (RFA) computes the geometric median RFA(x1, . . . , xn) := argmin v i=1 v xi 2 . Although there is no closed form solution for the geometric median, an approximation technique presented by Pillutla et al. [2022] involves performing several iterations of the smoothed Weiszfeld algorithm, with each iteration requiring a computation of complexity O(n). A.2 Bucketing algorithm We use the process of s-bucketing, propose by [Yang and Li, 2021, Karimireddy et al., 2022] to randomly divide n inputs, x1 to xn, into n/s buckets, each containing no more than s elements. After averaging the contents of each bucket to create y1, . . . , y n/s , we input them into the aggregator AGGR. The Bucketing Algorithm outlines the procedure. Our approach s main feature is that the resulting set of averaged y1, . . . , y n/s are more homogeneous (with lower variance) than the original inputs. Algorithm Bucketing Algorithm 1: input {x1, . . . , xn}, s N, aggregation rule AGGR 2: pick random permutation π of [n] 3: compute yi 1 s Pmin(n , i s) k=(i 1) s+1 xπ(k) for i = {1, . . . , n/s } 4: output bx AGGR(y1, . . . , y n/s ) // aggregate after bucketing A.3 Robust Aggregation examples Next we recall the result from [Karimireddy et al., 2022], that shows that aggregators which we saw, can be made to satisfy 1.1 by combining with bucketing. Theorem 8. Suppose we are given n inputs {x1, . . . , xn} such that E xi xj 2 ρ2 for any fixed pair i, j G and some ρ 0 for some δ δmax, with δmax to be defined. Then, running Bucketing Algorithm with s = δmax δ yields the following: Krum: E KRUM BUCKETING(x1, . . . , xn) x 2 O(δρ2) with δmax < 1 Geometric median: E RFA BUCKETING(x1, . . . , xn) x 2 O(δρ2) with δmax < 1 Coordinate-wise median: E CM BUCKETING(x1, . . . , xn) x 2 O(dδρ2) with δmax < 1 Note that all these methods satisfy the notion of an agnostic Byzantine robust aggregator (Definition 1.1). B Further Details on RDEG Originally RDEG was proposed for min-max problems and represents a variation of SEG with Univariate Trimmed-Mean Estimator aggregation rule. For convenience we give here RDEG pseudocode we used in experiments. In this section we use the notation π (0, 1) for a confidence level. Robust Distributed Extra-Gradient (RDEG) Input: TRIMϵ,α,δ, γ 1: for t = 1, ... do 2: for worker i [n] in parallel 3: gt ξi gi(xt, ξi) 4: send gt ξi if i G, else send if Byzantine 5: bgξt(xt) = TRIMϵ,α,δ (gt ξ1, . . . , gt ξn) 6: ext xt γ1bgξt(xt). 7: for worker i [n] in parallel 8: gt ηi gi(ext, ηi) 9: send gt ηi if i G, else send if Byzantine 10: bgηt(ext) = TRIMϵ,α,δ (gt η1, . . . , gt ηn) 11: xt+1 xt γ2bgηt(ext). Performance of Univariate Trimmed-Mean Estimator. The TRIM operator takes as input n vectors, and applies coordinatewisely the univariate trimmed mean estimator from Lugosi and Mendelson [2021], described bellow here as Univariate Trimmed-Mean Estimator Algorithm. Univariate Trimmed-Mean Estimator Algorithm Lugosi and Mendelson [2021] Input: Corrupted data set Z1, . . . , Zn/2, e Z1, . . . e Zn/2, corruption fraction δ, and confidence level π. 1: Set ϵ = 8δ + 24 log(4/π) n . 2: Let Z 1 Z 2 Z n/2 represent a non-decreasing arrangement of {Zi}i [n/2]. Compute quantiles: γ = Z ϵn/2 and β = Z (1 ϵ)n/2. 3: Compute robust mean estimate bµZ as follows: i=1 ϕγ,β( e Zi); ϕγ,β(x) = β x > β x x [γ, β] γ x < γ The following result on the performance of Univariate Trimmed-Mean Estimator plays a key role in the analysis of RDEG. Theorem. [Adibi et al., 2022, Theorem 1] Consider the trimmed mean estimator. Suppose δ [0, 1/16), and let π (0, 1) be such that π 4e n/2. Then, there exists an universal constant c, such that with probability at least 1 π, |bµZ µZ| cσZ Using the latter componentwise result the authors states that bgξt xt F(xt) cσ In fact this result is very similar to the Definition 1.1. The main difference is that for Univariate Trimmed-Mean Estimator we have a bound with some probability. The other difference that using the following representation of the result with ρ2 = c2σ2 bgξt xt F(xt) 2 δρ2 + ρ2 log(1/π) n , w.p. 1 π Univariate Trimmed-Mean Estimator has the additional term inversely depending on the number of workers. Moreover, the result requires δ [0, 1/16) in contrast to the aggregators we used, that work for wider range of corruption level δ [0, 1/5]. Performance guarantees for RDEG. The authors of [Adibi et al., 2022] consider only homogeneous case. Theorem. [Adibi et al., 2022, Theorem 3] Suppose Assumptions 3 and 4 hold in conjunction with the assumptions on δ and n: the fraction δ of corrupted devices satisfies δ [0, 1/16), and the number of agents n is sufficiently large: n 48 log(16d T 2). Then, with π = 1/(4d T 2) and step-size η 1/(4L), RDEG guarantees the following with probability at least 1 1/T: x x T +1 2 2e T 4κ R2 + 8cσRκ log(4d T 2) where κ = µ/L. The result implies that RDEG benefits of collaboration only when the corruption level is small. In fact, the term log(4d T 2) n log(4d T 2) 48 log(16d T 2) 1/48, so the corruption level should be less than 1/48 to make RDEG significantly benefit of collaboration in contrast to our SEG-CC that requires corruption level only less than 1/5. Moreover, in case of larger corruption level, RDEG converges to a ball centered at the solution with radius e O q in contrast to our methods SGDA-RA, SEG-RA and M-SGDA-RA converge to a ball centered at the solution with radius e O q µ2 , that has a better dependence on σ. It is crucial with increasing batchsize (b = batchsize), since σ2 depends on a batchsize as 1 C Auxilary results C.1 Basic Inequalities For all a, b Rn and λ > 0, q (0, 1] | a, b | a 2 2 2λ + λ b 2 2 2 , (11) a + b 2 2 2 a 2 2 + 2 b 2 2, (12) a + b 2 (1 + λ) a 2 + 1 + 1 2 a + b 2 2 a 2 2 b 2 2 , (14) 2 a b 2 2 + a 2 2 + b 2 2 , (15) i=1 ai 2, (16) 2 a 2 b 2, (17) 1 q 1 1 + q, (18) 1 + q C.2 Usefull Lemmas We write gt i or simply gi instead of gi(xt, ξt i) when there is no ambiguity. Lemma C.1. Suppose that the operator F is given in the form (1) and Assumptions 1 and 6 hold. Then Eξ g(x, ξ) 2 ℓ F(x), x x + σ2 where Eξ := Πi Eξi and g(x, ξ) = 1 i G gi(x; ξi). Proof of Lemma C.1. First of one can decomposeda squared norm of a difference and obtain Eξ g(x, ξ) F(x) 2 = Eξ g(x, ξ) 2 2 Eξg(x, ξ), F(x) + F(x) 2. Since g(x, ξ) = 1 i G gi(x; ξi), by the definition (1) of F and by Assumption 1 one has Eξg(x, ξ) = 1 i G Eξigi(x, ξi) = 1 i G Fi(x) = F(x), and consequently Eξ g(x, ξ) 2 = Eξ g(x, ξ) + F(x) 2 F(x) 2. (20) One can bound Eξ g(x, ξ) F(x) 2 as Eξ g(x, ξ) F(x) 2 = Eξ i G (gi(x; ξi) Fi(x)) independence of ξi = 1 G2 X i G Eξi gi(x; ξi) Fi(x) 2 σ2 where the last inequality of the above chain follows from (SC). The above chain together with (20) and (SC) implies the statement of the theorem. Lemma C.2. Let K > 0 be a positive integer and η1, η2, . . . , ηK be random vectors such that Ek[ηk] def = E[ηk | η1, . . . , ηk 1] = 0 for k = 2, . . . , K. Then k=1 E[ ηk 2]. (21) Proof. We start with the following derivation: = E[ ηK 2] + 2E = E[ ηK 2] + 2E = E[ ηK 2] + 2E = E[ ηK 2] + E Applying similar steps to E PK 1 k=1 ηk 2 , E PK 2 k=1 ηk 2 , . . . , E P2 k=1 ηk 2 , we get the Lemma C.3. Suppose r K r0(1 aγ)K + c1γ holds for γ γ0. Then the choise of γ min γ0, c0 implies that r K ε for Proof. Since b 3c0 3c0 . The choise of γ min γ0, c0 implies that The choice of K 1 a max c1 c0 , 1 ε implies that r0(1 aγ)K ε 3 and finishes the proof. Lemma C.4 (see also Lemma 2 from Stich [2019] and Lemma D.2 from Gorbunov et al. [2020]). Let {rk}k 0 satisfy r K r0(1 aγ)K+1 + c1γ + c2γ2 (23) for all K 0 with some constants a, c2 > 0, c1 0, γ γ0. γ0, ln max{2, min{ar0K/c1, a2r0K2/c2}} we have that r K = e O r0 exp ( aγ0(K + 1)) + c1 a K + c2 a2K2 Moreover r K ε after aε + c2 a2 ε iterations. Proof. We have r K r0(1 aγ)K+1 + c1γ + c2γ2 r0 exp ( aγ(K + 1)) + c1γ + c2γ2. (25) Next we consider two possible situations. 1. If γ0 ln(max{2,min{ar0K/c1,a2r0K2/c2}}) a(K+1) then we choose γ = ln(max{2,min{ar0K/c1,a2r0K2/c2}}) a(K+1) and get that r K (25) r0 exp ( aγ(K + 1)) + c1γ + c2γ2 ln max{2, min{ar0K/c1, a2r0K2/c2}} a(K + 1) a(K + 1) a K + c2 a2K2 = e O r0 exp ln max 2, min ar0K c1 , a2r0K2 a K + c2 a2K2 a K + c2 a2K2 2. If γ0 ln(max{2,min{a K/c1,a2r0K2/c2}}) a(K+1) then we choose γ = γ0 which implies that r K (25) r0 exp ( aγ0(K + 1)) + c1γ0 + c2γ2 0 = e O r0 exp ( aγ0(K + 1)) + c1 a K + c2 a2K2 Combining the obtained bounds we get the result. D Methods that use robust aggregators First of we provide the result of Karimireddy et al. [2022] that describes error of RAGG, where mt = αgt + (1 α)mt 1. Lemma D.1 (Aggregation error Karimireddy et al. [2022]). Given that RAGG satisfies 1.1 holds, the error between the ideal average momentum mt and the output of the robust aggregation rule mt for any t 1 can be bounded as E mt mt 2 cδ ρt 2 , where we define for t 1 ρt 2 := 4(6ασ2 + 3ζ2) + 4(6σ2 3ζ2)(1 α)t+1. For t = 0 we can simplify the bound as ρ0 2 := 24σ2 + 12ζ2. Moreover, one can state a uniform bound for (ρt)2 ρt 2 ρ2 = 24σ2 + 12ζ2. (26) D.1 Proofs for SGDA-RA Algorithm 1 SGDA-RA Input: RAGG, γ 1: for t = 0, ... do 2: for worker i [n] in parallel 3: gt i gi(xt, ξi) 4: send gt i if i G, else send if Byzantine 5: bgt = RAGG (gt 1, . . . , gt n) and xt+1 xt γbgt. // update params using robust aggregate D.1.1 Quasi-Strongly Monotone Case Theorem (Theorem 1 duplicate). Let Assumptions 1, 2, 4 and 6 hold. Then after T iterations SGDA-RA (Algorithm 1) with (δ, c)-RAGG and γ 1 2ℓoutputs x T such that E x T x 2 1 γµ T x0 x 2 + 2γσ2 µG + 2γcδρ2 where ρ2 = 24σ2 + 12ζ2 by Lemma D.1 with α = 1. Proof of Theorem 1. We start the proof with xt+1 x 2 = xt x γbgt 2 = xt x 2 2γ bgt, xt x + γ2 bgt 2. Since bgt = bgt F t + F t one has xt+1 x 2 = xt x 2 2γ bgt gt, xt x 2γ gt, xt x + γ2 bgt 2. Applying (11) for bgt gt, xt x with λ = γµ 2 and (12) for bgt 2 = bgt gt + gt 2 we derive xt+1 x 2 1 + γµ xt x 2 2γ gt, xt x bgt gt 2 + 2γ2 bgt gt 2 + 2γ2 gt 2. Next by taking an expectation Eξ of both sides of the above inequality and rearranging terms obtain Eξ xt+1 x 2 1 + γµ xt x 2 2γ F(xt), xt x µ Eξ bgt gt 2 + 2γ2Eξ bgt gt 2 + 2γ2Eξ gt 2. Next we use Lemmas C.1 and D.1 to derive Eξ xt+1 x 2 1 + γµ xt x 2 + 2γ2ℓ 2γ F(xt), xt x G + 2cδρ2 γ that together with the choice of γ 1 2ℓand Assumption (QSM) allows to obtain Eξ xt+1 x 2 1 γµ xt x 2 + 2γ2σ2 G + 2cδρ2 γ Next we take full expectation of both sides and obtain E xt+1 x 2 1 γµ E xt x 2 + 2γ2σ2 G + 2cδρ2 γ The latter implies E x T x 2 1 γµ T x0 x 2 + 4γσ2 µG + 4γcδρ2 where ρ is bounded by Lemma D.1 with α = 1. Corollary 1. Let assumptions of Theorem 1 hold. Then E x T x 2 ε holds after iterations of SGDA-RA with γ = min 1 2ℓ, 1 2µ+ µ 6cδG and b 72cδσ2 Proof. If ζ = 0, ρ2 = 24σ2 the result of Theorem 1 can be simplified as E x T x 2 1 γµ T x0 x 2 + 2γσ2 µG + 48γcδσ2 Applying Lemma C.3 to the last bound we get the result of the corollary. D.2 Proofs for SEG-RA Algorithm 2 SEG-RA Input: RAGG, γ 1: for t = 1, ... do 2: for worker i [n] in parallel 3: gt ξi gi(xt, ξi) 4: send gt ξi if i G, else send if Byzantine 5: bgξt(xt) = RAGG (gt ξ1, . . . , gt ξn) 6: ext xt γ1bgξt(xt). // update params using robust aggregate 7: for worker i [n] in parallel 8: gt ηi gi(ext, ηi) 9: send gt ηi if i G, else send if Byzantine 10: bgηt(ext) = RAGG (gt η1, . . . , gt ηn) 11: xt+1 xt γ2bgηt(ext). // update params using robust aggregate To analyze the convergence of SEG introduce the following notation gξk(xk) = gξk(xk) = 1 i G gi(xk, ξk i ) bgξk(xk) = RAGG g1(xk, ξk 1), . . . , gn(xk, ξk n) , exk = xk γ1bgξk(xk), gηk(exk) = gηk(xk) = 1 i G gi(xk, ηk i ) where ξk i , i G and ηk j , j G are i.i.d. samples satisfying Assumption 1, i.e., due to the independence we have Corollary 2. Suppose that the operator F is given in the form (1) and Assumption 1 holds. Then Eξk gξk(xk) F(xk) 2 σ2 Eηk gηk(exk) F(exk) 2 σ2 D.2.1 Auxilary results Lemma D.2. Let Assumptions 2, 3, 4 and Corollary 2 hold. If for SEG-RA (Algorithm 2), then gηk(exk) = gηk xk γ1bgξk(xk) satisfies the following inequality γ2 1E h gηk(exk) 2 | xki 2 b Pk + 8γ2 1σ2 G + 4γ2 1cδρ2, (30) where b Pk = γ1Eξk,ηk gηk(exk), xk x and ρ2 = 24σ2 + 12ζ2 by Lemma D.1 with α = 1. Proof. Using the auxiliary iterate bxk+1 = xk γ1gηk(exk), we get bxk+1 x 2 = xk x 2 2γ1 xk x , gηk(exk) + γ2 1 gηk(exk) 2 (31) = xk x 2 2γ1 xk γbgξk(xk) x , gηk(exk) 2γ2 1 bgξk(xk), gηk(exk) + γ2 1 gηk(exk) 2. (33) Taking the expectation Eξk,ηk [ ] = E | xk conditioned on xk from the above identity, using tower property Eξk,ηk[ ] = Eξk[Eηk[ ]], and µ-quasi strong monotonicity of F(x), we derive Eξk,ηk h bxk+1 x 2i = xk x 2 + γ2 1Eξk,ηk h gηk(exk) 2i 2γ1Eξk,ηk xk γ1bgξk(xk) x , gηk(exk) 2γ2 1Eξk,ηk bgξk(xk), gηk(exk) 2γ1Eξk xk γ1bgξk(xk) x , F xk γ1bgξk(xk) 2γ2 1Eξk bgξk(xk), gηk(exk) + γ2 1Eξk,ηk h gηk(exk) 2i (QSM),(14) xk x 2 γ2 1Eξk,ηk h bgξk(xk) 2i +γ2 1Eξk,ηk h bgξk(xk) gηk(exk) 2i . To upper bound the last term we use simple inequality (16), and apply L-Lipschitzness of F(x): Eξk,ηk h bxk+1 x 2i (16) xk x 2 γ2 1Eξk h bgξk(xk) 2i +4γ2 1Eξk h gξk(xk) bgξk(xk) 2i +4γ2 1Eξk h F(xk) F exk 2i +4γ2 1Eξk h gξk(xk) F(xk) 2i +4γ2 1Eξk,ηk h gηk exk F exk 2i (Lip),(27),(28) xk x 2 γ2 1 1 4L2γ2 1 Eξk h bgξk(xk) 2i +4γ2 1Eξk h gξk(xk) bgξk(xk) 2i G + 4γ2 1σ2 (16),Lemma D.1 xk x 2 γ2 1 1 4γ2 1L2 Eξk h bgξk(xk) 2i G + 4γ2 1cδρ2 (29) xk x 2 + 8γ2 1σ2 G + 4γ2 1cδρ2. Finally, we use the above inequality together with (31): xk x 2 2 b Pk + γ2 1E h gηk(exk) 2 | xki xk x 2 + 8γ2 1σ2 G + 4γ2 1cδρ2, where b Pk = γ1Eξk,ηk gηk(exk), xk x . Rearranging the terms, we obtain (30). Lemma D.3. Consider SEG-RA (Algorithm 2). Let Assumptions 2, 3, 4 and Corollary 2 hold. If γ1 1 2µ + 2L, (34) then gηk(exk) = gηk xk γ1bgξk(xk) satisfies the following inequality xk x 2 + γ2 1 4 Eξk h gξk(xk) 2i 8γ2 1σ2 G 9γ2 1cδρ2 xk x 2 + 4γ2 1σ2 G + 4γ2 1cδρ2 where b Pk = γ1Eξk,ηk gηk(exk), xk x and ρ2 = 24σ2 + 12ζ2 by Lemma D.1 with α = 1. Proof. Since Eξk,ηk[ ] = E[ | xk] and gηk(exk) = gηk xk γ1bgξk(xk) , we have b Pk = γ1Eξk,ηk gηk(exk), xk x = γ1Eξk Eηk[gηk(exk)], xk γ1bgξk(xk) x γ2 1E gηk(exk), bgξk(xk) (14) = γ1Eξk F(xk γ1bgξk(xk)), xk γ1bgξk(xk) x γ2 1 2 Eξk,ηk h gηk(exk) 2i γ2 1 2 Eξk h bgξk(xk) 2i +γ2 1 2 Eξk,ηk h gηk(exk) bgξk(xk) 2i (QSM),(16) µγ1Eξk,ηk h xk x γ1bgξk(xk) 2i γ2 1 2 Eξk h bgξk(xk) 2i +4γ2 1 2 Eξk h gξk(xk) bgξk(xk) 2i +4γ2 1 2 Eξk h F(xk) F exk 2i +4γ2 1 2 Eξk h gξk(xk) F(xk) 2i +4γ2 1 2 Eξk,ηk h gηk exk F exk 2i (17),(Lip),Lem. D.1,Cor. 2 xk x 2 γ2 1 2 (1 2γ1µ 4γ2 1L2)Eξk h bgξk(xk) 2i 2G + 4γ2 1σ2 2G + 4γ2 1cδρ2 xk x 2 γ2 1 2 Eξk h bgξk(xk) 2i + 4γ2 1σ2 G + 4γ2 1cδρ2 So one have xk x 2 γ2 1 4 Eξk h gξk(xk) 2i + 4γ2 1σ2 G + 9γ2 1cδρ2 xk x 2 + 4γ2 1σ2 G + 4γ2 1cδρ2 that concludes the proof. D.2.2 Quasi-Strongly Monotone Case Combining Lemmas D.2 and D.3, we get the following result. Theorem (Theorem 2 duplicate). Let Assumptions 1, 2, 3 and 4 hold. Then after T iterations SEG-RA (Algorithm 2) with (δ, c)-RAGG, γ1 1 2µ+2L and β = γ2/γ1 1/4 outputs x T such that E x T x 2 1 µβγ1 T x0 x 2 + 8γ1σ2 µβG + 8cδρ2 γ1 where ρ2 = 24σ2 + 12ζ2 by Lemma D.1 with α = 1. Proof of Theorem 2. Since xk+1 = xk γ2bgηk exk , we have xk+1 x 2 = xk γ2bgηk exk x 2 = xk x 2 2γ2 bgηk exk , xk x + γ2 2 bgηk exk 2 xk x 2 2γ2 gηk exk , xk x + 2γ2 2 gηk exk 2 + 2γ2 2 gηk exk bgηk exk 2 + 2γ2 gηk exk bgηk exk , xk x (1 + λ) xk x 2 2γ2 gηk exk , xk x + 2γ2 2 gηk exk 2 gηk exk bgηk exk 2 Taking the expectation, conditioned on xk, Eξk,ηk xk+1 x 2 (1 + λ) xk x 2 2βγ1Eξk,ηk gηk exk , xk x +2β2γ2 1Eξk,ηk gηk exk 2 + γ2 2cδρ2 2 + 1 using the definition of b Pk = γ1Eξk,ηk gηk(exk), xk x , we continue our derivation: Eξk,ηk h xk+1 x 2i = (1 + λ) xk x 2 2β b Pk + 2β2γ2 1Eξk,ηk gηk exk 2 +γ2 2cδρ2 2 + 1 (30) (1 + λ) xk x 2 2β b Pk + 2β2 2 b Pk + 8γ2 1σ2 G + 4γ2 1cδρ2 +γ2 2cδρ2 2 + 1 0 β 1/2 (1 + λ) xk x 2 2 b Pk β 2β2 + 16γ2 2σ2 G + 8γ2 2cδρ2 +γ2 2cδρ2 2 + 1 (35) (1 + λ) xk x 2 +2β(1 2β) µγ1 xk x 2 + 4γ2 1σ2 G + 4γ2 1cδρ2 G + 8γ2 2cδρ2 + γ2 2cδρ2 2 + 1 1 + λ 2β(1 2β)µγ1 G + γ2 1cδρ2 + 16γ2 2σ2 G + 8γ2 2cδρ2 + γ2 2cδρ2 2 + 1 0 β 1/4 1 + λ µγ2 xk x 2 + σ2 G γ2 1 + 16γ2 2 +cδρ2 γ2 1 + 10γ2 2 + γ2 2 λ λ=µγ2/4 1 µγ2 xk x 2 + σ2 G γ2 1 + 16γ2 2 +cδρ2 γ2 1 + 10γ2 2 + 4γ2 Next, we take the full expectation from the both sides E h xk+1 x 2i 1 µγ2 E h xk x 2i + σ2 G γ2 1 + 16γ2 2 +cδρ2 γ2 1 + 10γ2 2 + 4γ2 Unrolling the recurrence, together with the bound on ρ given by Lemma D.1 we derive the result of the theorem: E h x K x 2i 1 µγ2 K x0 x 2 + 4σ2(γ2 1 + 16γ2 2) µγ2G + 4cδρ2(γ2 1 + 10γ2 2 + 4γ2 Corollary 3. Let assumptions of Theorem 2 hold. Then E x T x 2 ε holds after βµ + 1 3βcδG iterations of SEG-RA with γ1 = min 1 2µ+2L, 1 2µ+ µ 12cδG and b 288cδσ2 Proof. Next, we plug γ2 = βγ1 γ1/4 into the result of Theorem 2 and obtain E h xk+1 x 2i 1 µβγ1 x0 x 2 + 8γ2 1σ2 µβG + 8γ1cδρ2 βµ + 16cδρ2 If ζ = 0, ρ2 = 24σ2 the last reccurence can be unrolled as E x T x 2 1 µβγ1 T x0 x 2 + 8γ1σ2 µβG + 8 24γ1cδσ2 βµ + 16 24cδσ2 Applying Lemma C.3 to the last bound we get the result of the corollary. D.3 Proofs for M-SGDA-RA Algorithm 3 M-SGDA-RA Input: RAGG, γ, α [0, 1] 1: for t = 0, ... do 2: for worker i [n] in parallel 3: gt i gi(xt, ξi) and mt i (1 α)mt 1 i + αgt i // worker momentum 4: send mt i if i G, else send if Byzantine 5: c mt = RAGG (mt 1, . . . , mt n) and xt+1 xt γc mt. // update params using robust aggregate D.3.1 Quasi-Strongly Monotone Case Theorem (Theorem 3 duplicate). Let Assumptions 1, 2, 4, and 6 hold. Then after T iterations M-SGDA-RA (Algorithm 3) with (δ, c)-RAGG outputs x T such that E h x T x 2i 2 x0 x 2 µγαWT + 4cδρ2 µ2α2 + 8γcδρ2 where x T = 1 WT PT t=0 wtbxt, bxt = α 1 (1 α)t+1 Pt j=0(1 α)t jxj, wt = 1 µγα and WT = PT t=0 wt and ρ2 = 24σ2 + 12ζ2 by Lemma D.1. Proof of Theorem 3. Since c mt = c mt mt + mt and mt = αgt + (1 α)mt 1 one has xt+1 x 2 = xt x γc mt 2 = xt x 2 2γ c mt, xt x + γ2 c mt 2 = = xt x 2 2γ c mt mt, xt x + γ2 c mt 2 2γ mt, xt x . Next, unrolling the following recursion mt, xt x = α gt, xt x + (1 α) mt 1, xt x = α gt, xt x + (1 α) mt 1, xt 1 x + (1 α) mt 1, xt xt 1 = α gt, xt x + (1 α) mt 1, xt 1 x + (1 α)γ mt 1, c mt one obtains mt, xt x = α j=0 (1 α)t j gj, xj x (1 α)γ j=1 (1 α)t j mj 1, c mj Applying the latter and (11) for c mt mt, xt x with λ = µγα 2 we obtain j=0 (1 α)t j gj, xj x xt x 2 xt+1 x 2 + 2γ +γ2 c mt 2 + 2γ2(1 α) j=1 (1 α)t j mj 1, c mj (11) 1 + µγα xt x 2 xt+1 x 2 + 2γ +γ2 c mt 2 + γ2 t X j=1 (1 α)t j c mj 2 +γ2(1 α)2 t X j=1 (1 α)t j mj 1 2 (12) 1 + µγα xt x 2 xt+1 x 2 + 2γ j=1 (1 α)t j c mj mj 2 j=1 (1 α)t j mj 1 2 Since mt = α Pt j=0(1 α)t jgj and hence mt 2 α Pt j=0(1 α)t j gj 2 one has j=0 (1 α)t j gj, xj x xt x 2 xt+1 x 2 + 2γ j=1 (1 α)t j c mj mj 2 j=1 (1 α)t j j X i=0 (1 α)j i gi 2. Next by taking an expectation Eξ of both sides of the above inequality and rearranging terms obtain j=0 (1 α)t j F j, xj x xt x 2 Eξ xt+1 x 2 µαEξ c mt mt 2 + 4γ2 t X j=1 (1 α)t j Eξ c mj mj 2 j=1 (1 α)t j j X i=0 (1 α)j i Eξ gi 2. Next we use Lemma C.1 to bound Eξ gj 2 and Assumption (QSM) to obtain the following bound j=0 (1 α)t j F j, xj x α j=0 (1 α)t j xj x 2 αµ xt x 2. Gathering the above results we have j=0 (1 α)t j F j, xj x xt x 2 Eξ xt+1 x 2 µαEξ c mt mt 2 + 4γ2 t X j=1 (1 α)t j Eξ c mj mj 2 j=1 (1 α)t j j X i=0 (1 α)j i F(xi), xi x j=1 (1 α)t j j X i=0 (1 α)j i. Next we take full expectation of both sides and obtain j=0 (1 α)t j E F j, xj x E xt x 2 E xt+1 x 2 µαE c mt mt 2 + 4γ2 t X j=1 (1 α)t j E c mj mj 2 j=1 (1 α)t j j X i=0 (1 α)j i E F(xi), xi x Introducing the following notation j=0 (1 α)t j E F j, xj x and using that E c mt mt 2 cδρ2, where ρ is given by (26) we have E xt x 2 E xt+1 x 2 + 2γcδρ2 j=1 (1 α)t j Zj + 3γ2σ2 Next we sum the above inequality T times with weights wt = 1 µγα 2 t 1 where WT = PT t=0 wt t=0 wt Zt 1 µγα t=0 wt E xt x 2 t=0 wt E xt+1 x 2 j=0 (1 α)t j Zj µα + 4γ2cδρ2 Since 1 µγα 2 wt = wt 1 t=0 wt Zt x0 x 2 + 3γ2αℓ j=0 (1 α)t j Zj µα + 4γ2cδρ2 t+i wi 1 + µγα t i wi 1 + α t=0 wt Zt x0 x 2 t j (1 α)t jwj Zj µα + 4γ2cδρ2 x0 x 2 + 3γ2αℓ µα + 4γ2cδρ2 t=0 wt Zt ! X µα + 4γ2cδρ2 x0 x 2 + 6γ2ℓ t=0 wt Zt ! µα + 4γ2cδρ2 If γ α 12ℓthen the following is true t=0 wt Zt x0 x 2 + WT µα + 4γ2cδρ2 Using the notations for Zt and (QSM) we have j=0 (1 α)t j E F j, xj x µ j=0 (1 α)t j E xj x . and consequently by Jensen s inequality j=0 (1 α)t j E xj x µ1 (1 α)t+1 1 (1 α)t+1 xj With the definition bxt = α 1 (1 α)t+1 Pt j=0(1 α)t jxj then the above implies that Zt µE bxt x , that together with (40) gives t=0 wt E bxt x x0 x 2 + WT µα + 4γ2cδρ2 Applying the Jensen inequality again with x T = 1 WT PT t=0 wtbxt we derive the final result E h x T x 2i 2 x0 x 2 µγαWT + 4cδρ2 µ2α2 + 8γcδρ2 µα2G , (42) together with the bound on ρ given by Lemma D.1. Corollary 4. Let assumptions of Theorem 3 hold. Then E x T x 2 ε holds after µα + 1 8cδG iterations of M-SGDA-RA with γ = min α 12ℓ, 1 2µ+ µ 16cδG Proof. If ζ = 0, ρ2 = 24σ2 the result of Theorem 3 can be simplified as E h x T x 2i 2 x0 x 2 µγαWT + 4 24cδσ2 µ2α2 + 8 24γcδσ2 Since 2 µγαWT 1 µγα 2 T +1 we can apply Lemma C.3 and get the result of the corollary. E Methods with random check of computations We replace (δmax, c)-RAGG with the simple mean, but introduce additional verification that have to be passed to accept the mean. The advantage of such aggregation that it coincides with "good" mean if there are no peers violating the protocol But if there is at least one peer violating the protocol at iteration t we can bound the variance similar to Lemma D.1. Algorithm 4 Check Computations Input: t, Gt Bt, Ct, Bannedt = 1: Ct+1 = {ct+1 1 , . . . , ct+1 m }, Ct+1 (Gt Bt) \ Ct and Ut+1 = {ut+1 1 , . . . , ut+1 m }, Ut+1 (Gt Bt) \ Ct, where 2m workers ct+1 1 , . . . , ct+1 m , ut+1 1 , . . . , ut+1 m are choisen uniformly at random without replacement. 2: for i = 1, ..., m in parallel ct+1 i checks computations of ut+1 i during the next iteration 3: ct+1 i receives a query to recalculate g(xt, ξt ut+1 i ) 4: ct+1 i sends the recalculated g(xt, ξt ut+1 i ) 5: for i = 1, ..., m do 6: if g(xt, ξt ut i) = gt ut i then 7: Bannedt = Bannedt {ut i, ct i}. 8: end if Output: Ct+1, Gt Bt \ Bannedt Lemma E.1. Let Assumption 2 is satisfied with ζ = 0. Then the error between the ideal average gt and the average with the recomputation rule gt can be bounded as Eξ bgt gt 2 ρ21t, where ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1). Proof of Lemma E.1. Denote the set e G in the following way e G = {i Gt \ Ct : bgt gt i Cσ}. n Pn i=1 gt i, if number of workers > n 2 , recompute, otherwise. So that we have i e G gt i + 1 i e G gt i gt i e G gt i gt i e G gt i gt If δ 1/4 then an acceptance of bgt implies that | e G| > n/4 and | e G| > |Gt\Ct|/3. i e G gt i gt gt i gt 2 1 3 |Gt \ Ct| Bringing the above results together gives that E bgt gt 2 2C2σ2 + 6 |Gt \ Ct| i G E gt i gt 2 Since checks of computations are only possible in homogeneous case (ζ = 0) then E gt i F t 2 = σ2 and Eξ gt i gt 2 2Eξ gt i F t 2 + 2Eξ F t gt 2 2σ2 + 2σ2 Since |Gt \ Ct| > n 2B m Eξ bgt gt 2 2C2σ2 + 12σ2 + 12σ2 |Gt \ Ct| 2C2σ2 + 12σ2 + 12σ2 E.1 Proofs for SGDA-CC Algorithm 5 SGDA-CC 1: C0 = 2: for t = 1, ... do 3: for worker i (Gt Bt) \ Ct in parallel 4: send gt i = gi(xt, ξi), if i Gt \ Ct, , if i Bt \ Ct,, 5: bgt = 1 Wt P i Wt gt i, Wt = (Gt Bt) \ Ct 6: 7: if |{i Wt | bgt gt i Cσ}| |Wt|/2 then 8: xt+1 xt γbgt. 9: else 10: recompute 11: end if 12: Ct+1, Gt+1 Bt+1 = Check Computations(Ct, Gt Bt) E.1.1 Star Co-coercieve Case Next we provide convergence guarantees for SGDA-CC (Algorithm 5) under Assumption 6. Theorem 9. Let Assumptions 1 and 6 hold. Next, assume that where ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1) by Lemma E.1 and R x0 x . Then after K iterations of SGDA-CC (Algorithm 5) it outputs x T such that k=0 E F(xk) ℓ k=0 E[ F(xk), xk x ] 2ℓR2 Proof. Since |Gt \ Ct| n 2B m one can derive using the results of Lemmas C.1 and E.1 that E xk+1 x 2 | xk = E xk x γbgk 2 | xk = xk x 2 2γE xk x , bgk | xk + γ2E bgk 2 | xk xk x 2 2γ xk x , F(xk) + 2ℓγ2 xk x , F(xk) 2γE xk x , bgk gk | xk + 2γ2ρ21k + 2γ2σ2 where 1k is an indicator function of the event that at least 1 Byzantine peer violates the protocol at iteration k. To estimate the inner product in the right-hand side we apply Cauchy-Schwarz inequality and then Lemma E.1 2γE xk x , bgk gk | xk 2γ xk x E bgk gk | xk E bgk gk 2 | xk 2γρ xk x 1k. Since γ 1 2ℓthe above results imlies γ xk x , F(xk) xk x 2 E xk+1 x 2 | xk 2γE xk x , bgk gk | xk + 2γ2ρ21k + 2γ2σ2 xk x 2 E xk+1 x 2 | xk +2γ2ρ21k + 2γ2σ2 n 2B m + 2γρ xk x 1k. Taking the full expectation from the both sides of the above inequality and summing up the results for k = 0, 1, . . . , K 1 we derive k=0 E[ F(xk), xk x ] E xk x 2 E xk+1 x 2 + 2γ2σ2 k=0 E xk x 1k + 2γ2ρ2 x0 x 2 E[ x K x 2] E [ xk x 2] E[1k] + 2γ2ρ2 Since F satisfies Assumption 6, K 1 P k=0 E[ F(xk), xk x ] 0. Using this and new notation Rk = xk x , k > 0, R0 x0 x we get 0 R2 0 E[R2 K] K + 2γ2σ2 E [R2 k] E[1k] + 2γ2ρ2 k=0 E[1k] (45) implying (after changing the indices) that E[R2 k] R2 0 + 2γ2σ2k n 2B m + 2γρ E [R2 l ] E[1l] + 2γ2ρ2 k 1 X l=0 E[1l] (46) holds for all k 0. In the remaining part of the proof we derive by induction that R2 0 + 2γ2σ2k n 2B m + 2γρ E [R2 l ] E[1l] + 2γ2ρ2 k 1 X l=0 E[1k] 2R2 0 (47) for all k = 0, . . . , K. For k = 0 this inequality trivially holds. Next, assume that it holds for all k = 0, 1, . . . , T 1, T K 1. Let us show that it holds for k = T as well. From (46) and (47) we have that E[R2 k] 2R2 0 for all k = 0, 1, . . . , T 1. Therefore, E[R2 T ] R2 0 + 2γ2σ2T n 2B m + 2γρ E [R2 l ] E[1l] + 2γ2ρ2 T 1 X R2 0 + 2γ2σ2T n 2B m + 2 E[1l] + 2γ2ρ2 T 1 X If a Byzantine peer deviates from the protocol at iteration k, it will be detected with some probability pk during the next iteration. One can lower bound this probability as nk = m(1 δk) Therefore, each individual Byzantine worker can violate the protocol no more than 1/p times on average implying that E[R2 T ] R2 0 + 2γ2σ2T n 2B m + 2 m + 2γ2ρ2n B (n 2B m)R2 0 6σ2K , m2R2 0 72ρ2B2n2 we ensure that 2γ2σ2T n 2B m + 2 m + 2γ2ρ2n B m R2 0 3 + R2 0 3 + R2 0 3 = R2 0, and, as a result, we get E[R2 T ] 2R2 0 2R (48) . Therefore, (47) holds for all k = 0, 1, . . . , K. Together with (45) it implies k=0 E[ F(xk), xk x ] 2R2 0 γ . The last inequality together with Assumption 6 implies k=0 E F(xk) 2ℓR2 0 γ . Corollary 5. Let assumptions of Theorem 9 hold. Then 1 k=0 E F(xk) ε holds after nε2 + σn2ℓR iterations of SGDA-CC. k=0 E F(xk) 2ℓR2 6σ2K (n 2B m)R2 + (n 2B m)K + 17ρBnℓR Let us chose K such that each of the last three terms less or equal ε/3, then K = max 6ℓ2R2 ε , 216σ2ℓ2R2 (n 2B m)ε2 , 51ρBnℓR where ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1) by Lemma E.1. The latter implies that 1 K k=0 E F(xk) ε. Using the definition of ρ from Lemma E.1 and if B n 4 , m << n the bound for K can be easily derived. E.1.2 Quasi-Strongly Monotone Case Theorem (Theorem 4 duplicate). Let Assumptions 1, 4 and 6 hold. Then after T iterations SGDA-CC (Algorithm 5) with γ 1 2ℓoutputs x T such that E x T +1 x 2 1 γµ T +1 x0 x 2 + 4γσ2 µ(n 2B m) + 2ρ2n B where ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1) by Lemma E.1. Proof of Theorem 4. The proof is similar to the proof of Theorem 1 µE h x K x 2i = µE k=0 E h xk x 2i (QSM) 1 K k=0 E F(xk), xk x Since bgt = bgt F t + F t one has xt+1 x 2 = xt x 2 2γ bgt gt, xt x 2γ gt, xt x + γ2 bgt 2. Applying (11) for gt, xt x with λ = γµ 2 and (12) for bgt 2 = bgt gt + gt 2 we derive xt+1 x 2 1 + γµ xt x 2 2γ gt, xt x bgt gt 2 + 2γ2 bgt gt 2 + 2γ2 gt 2. Next by taking an expectation Eξ of both sides of the above inequality and rearranging terms obtain Eξ xt+1 x 2 1 + γµ xt x 2 2γ F(xt), xt x µ Eξ bgt gt 2 + 2γ2Eξ bgt gt 2 + 2γ2Eξ gt 2. The difference with the proof of Theorem 1 is that we suppose that the number of peer violating the protocol at an iteration t is known to any "good" peer. So the result of Lemma E.1 writes as follows Eξ bgt gt 2 ρ21t, where 1t is an indicator function of the event that at least 1 Byzantine peer violates the protocol at iteration t. Together with Lemma C.1 we can proceed as follows m Eξ xt+1 x 2 1 + γµ xt x 2 + 2γ2ℓ 2γ F(xt), xt x G + 21tρ2 γ Since γ 1 2ℓand Assumption (QSM) holds we derive Eξ xt+1 x 2 1 γµ xt x 2 + 2γ2σ2 G + 21tρ2 γ Next we take full expectation of both sides and obtain E xt+1 x 2 1 γµ E xt x 2 + 2γ2σ2 n 2B m + 2ρ2 γ µ + γ2 E1t. The latter implies E x T +1 x 2 1 γµ T +1 x0 x 2 + 4γσ2 If a Byzantine peer deviates from the protocol at iteration t, it will be detected with some probability pt during the next iteration. One can lower bound this probability as nt = m(1 δt) Therefore, each individual Byzantine worker can violate the protocol no more than 1/p times on average implying that that implies The latter together with the above bound on E x T +1 x 2 implies the result of the theorem. E x T +1 x 2 1 γµ T +1 x0 x 2 + 4γσ2 µ(n 2B m) + 2ρ2n B Corollary 6. Let assumptions of Theorem 4 hold. Then E x T x 2 ε holds after µ2(n 2B m)ε + qσ2Bn µ2mε + qσ2Bn iterations of SGDA-CC (Alg. 5) with 1 2ℓ, 2 ln max n 2, min n m(n 2B m)µ2R2K 8mσ2+4qσ2n B(n 2B m), mµ2R2K2 Proof. Using the definition of ρ (ρ2 = qσ2 = O(σ2)) from Lemma E.1 and if B n 4 , m << n the result of Theorem 4 can be simplified as E x T +1 x 2 1 γµ T +1 x0 x 2 + 4γσ2 µ(n 2B m) + 2qσ2n B Applying Lemma C.4 to the last bound we get the result of the corollary. E.1.3 Monotone Case Theorem 10. Suppose the assumptions of Theorem 9 and Assumption 5 hold. Next, assume that where ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1) by Lemma E.1 and R x0 x . Then after K iterations of SGDA-CC (Algorithm 5) E h Gap BR(x ) where Gap BR(x ) x K = max u BR(x ) F(u), x K u , x K = 1 k=0 xk and R x0 x . Proof. Combining (16), (14) one can derive 2γ F(xk), xk u xk u 2 xk+1 u 2 2γ bgk gk, xk u 2γ gk F k, xk u +2γ2 bgk gk 2 + 2γ2 gk 2. Assumption 5 implies that F(u), xk u F(xk), xk u (52) and consequently by Jensen inequality 2γK F(u), x K u x0 u 2 2γ bgk gk, xk u gk F k, xk u + 2γ2 K 1 X bgk gk 2 + gk 2 . Then maximization in u gives 2γKGap BR(x ) x K max u BR(x ) x0 u 2 + 2γ2 K 1 X bgk gk 2 + gk 2 +2γ max u BR(x ) gk bgk, xk u ! +2γ max u BR(x ) F k gk, xk u ! Taking the full expectation from the both sides of the previous inequality gives 2γKE h Gap BR(x ) x K i max u BR(x ) max u BR(x ) gk bgk, xk u !# max u BR(x ) F k gk, xk u !# bgk gk 2 + gk 2 # Firstly obtain the bound for the terms that do not depend on u using Assumption 1, Lemma E.1 and Theorem 9 bgk gk 2 + gk 2 # |Gt \ Ct| + 2γ2ℓ k=0 E F k, xk x |Gt \ Ct| + 4ℓγR2. Since E xk u E xk x + x u E xk x + max u BR(x ) x u (48) 3R one can derive that max u BR(x ) gk bgk, xk u !# max u BR(x ) gk bgk, x u ! gk bgk, xk x # k=0 E max u BR(x ) gk bgk, x u + 2γE gk bgk, xk x # k=0 E max u BR(x ) gk bgk x u + 2γE gk bgk xk x | xk ## k=0 E R gk bgk + 2γE k=0 ρ1k xk x # Following Beznosikov et al. [2023] one can derive he bound for the next term: k=0 F k gk, xk k=0 E[F k gk | xk], xk k=0 F k gk, x0 E[F k gk], x0 = 0, max u BR(x ) k=0 F k gk, xk u k=0 F k gk, xk max u BR(x ) k=0 F k gk, u max u BR(x ) k=0 F k gk, u k=0 F k gk, x0 max u BR(x ) k=0 F k gk, u max u BR(x ) k=0 (F k gk), x0 u max u BR(x ) k=0 (F k gk) k=0 (F k gk) + max u BR(x ) x0 u 2. We notice that E[F k gk | F 0 g0, . . . , F k 1 gk 1] = 0 for all k 1, i.e., conditions of Lemma C.2 are satisfied. Therefore, applying Lemma C.2, we get max u BR(x ) k=0 F k gk, xk u k=0 E[ F k gk 2] + max u BR(x ) x0 u 2 (53) |Gt \ Ct| + max u BR(x ) x0 u 2. (54) Assembling the above results together gives 2γKE h Gap BR(x ) 2 max u BR(x ) x0 u 2 + 2γ2n Bρ2 |Gt \ Ct| + 4ℓγR2 + 6n BγRρ 2 max u BR(x ) x0 u 2 + 2γ2n Bρ2 +4ℓγR2 + 6n BγRρ Corollary 7. Let assumptions of Theorem 10 hold. Then E h Gap BR(x ) x K i ε holds after iterations of SGDA-CC. E h Gap BR(x ) 6σ2K (n 2B m)R2 + (n 2B m)K + 26ρBn R Let us chose K such that each of the last three terms less or equal ε/3, then K = max 18ℓR2 ε , 9 54σ2R2 (n 2B m)ε2 , 78ρBn R guarantees that E h Gap BR(x ) Using the definition of ρ from Lemma E.1 and if B n 4 , m << n the bound for K can be easily derived. E.2 Proofs for R-SGDA-CC E.2.1 Quasi-Strongly Monotone Case Algorithm 6 R-SGDA-CC Input: x0 starting point, r number of restarts, {γt}r t=1 stepsizes for SGDA-CC (Alg. 5), {Kt}r t=1 number of iterations for SGDA-CC (Alg. 5) 1: bx0 = x0 2: for t = 1, 2, . . . , r do 3: Run SGDA-CC (Alg. 5) for Kt iterations with stepsize γt, starting point bxt 1. 4: Define bxt as bxt = 1 Kt k=0 xk,t, where x0,t, x1,t, . . . , x Kt,t are the iterates produced by SGDA-CC (Alg. 5). 5: end for Output: bxr Theorem (Theorem 5 duplicate). Let Assumptions 1, 4 and 6 hold. Then, after r = l log2 R2 restarts R-SGDA-CC (Algorithm 6) with γt = min 1 2ℓ, q 6σ22t Kt , q m2R2 72qσ22t B2n2 Kt = max 8ℓ µ , 96σ22t (n 2B m)µ2R2 , 34nσB , where R x0 x , outputs bxr such that E bxr x 2 ε. Moreover, the total number of executed iterations of SGDA-CC is t=1 Kt = O ℓ µ log µR2 0 ε + σ2 (n 2B m)µε + n Bσ With q = 2C2 + 12 + 12 n 2B m and C = O(1) by Lemma E.1. Proof of Theorem 5. x K = 1 K PK 1 k=0 xk µE h x K x 2i = µE k=0 E h xk x 2i (QSM) 1 K k=0 E F(xk), xk x Theorem 9 implies that SGDA-CC with (n 2B m)R2 0 6σ2K , m2R2 0 72ρ2B2n2 µE h x K x 2i 2 after K iterations. After the first restart we have E h bx1 x 2i 2R2 0 µγ1K1 R2 0 2 . Next, assume that we have E[ bxt x 2] R2 0 2t for some t r 1. Then, Theorem 9 implies that E h bxt+1 x 2 | bxti 2 bxt x 2 Taking the full expectation from the both sides of previous inequality we get E h bxt+1 x 2i 2E[ bxt x 2] µγt Kt 2R2 0 2tµγt Kt R2 0 2t+1 . Therefore, by mathematical induction we have that for all t = 1, . . . , r E bxt x 2 R2 0 2t . Then, after r = l log2 R2 0 ε m 1 restarts of SGDA-CC we have E bxr x 2 ε. The total number of iterations executed by SGDA-CC is ( ℓ µ, σ22t (n 2B m)µ2R2 0 , n Bρ2 t 2 mµR0 (n 2B m)µ2R2 0 + n Bρ2 r 2 mµR0 ℓ µ log µR2 0 ε + σ2 (n 2B m)µ2R2 0 µR2 0 ε + n Bρ µ log µR2 0 ε + σ2 (n 2B m)µε + n Bρ Corollary 8. Let assumptions of 5 hold. Then E bxr x 2 ε holds after t=1 Kt = O ℓ nµε + n2σ m µε iterations of SGDA-CC. Proof. Using the definition of ρ from Lemma E.1 and if B n 4 , m << n the bound for r P be easily derived. E.3 Proofs for SEG-CC Algorithm 7 SEG-CC Input: RAGG, γ 1: for t = 1, ... do 2: for worker i [n] in parallel 3: gt ξi gi(xt, ξi) 4: send gt ξi if i Gt, else send if Byzantine 5: bgξt(xt) = 1 Wt 1 2 gt ξi, Wt 1 2 = (Gt Bt) \ Ct 6: if n i Wt 1 2 | bgξt(xt) gt ξt i 7: ext xt γ1bgξt(xt) 8: else 9: recompute 10: end if 11: Ct+ 1 2 = Check Computations(Ct, Gt Bt) 12: for worker i [n] in parallel 13: gt ηi gi(ext, ηi) 14: send gt ηi if i G, else send if Byzantine 15: bgηt(ext) = 1 Wt P i Wt gt ηi, Wt = (Gt+ 1 2 ) \ Ct+ 1 2 16: if i Wt | bgηt(ext) gt ηi Cσ |Wt|/2 then 17: xt+1 xt γ2bgηt(ext) 18: else 19: recompute 20: end if 21: Ct+1, Gt+1 Bt+1 = Check Computations(Ct+ 1 E.3.1 Auxilary results Similarly to Section E.1 we state the following. If a Byzantine peer deviates from the protocol at iteration k, it will be detected with some probability pk during the next iteration. One can lower bound this probability as nk = m(1 δk) Therefore, each individual Byzantine worker can violate the protocol no more than 1/p times on average implying that X l=0 E[1l] + l=0 E h 1l 1 Lemma E.2. Let Assumption 3 holds. Let Algorithm 7 is run with γ1 1/2L and β = γ2/γ1 1/2. Then its iterations satisfy gηk(exk), exk u xk u 2 xk+1 u 2 2γ2 bgηk(exk) gηk(exk), xk u +2γ2 2 bgηk(exk) gηk(exk) 2 + 4γ1γ2 gηk(exk) F(exk) 2 +4γ1γ2 F(xk) gξk(xk) 2 + 4γ1γ2 gξk(xk) bgξk(xk) 2. Proof. Since xk+1 = xk γ2bgηk(exk), we have xk+1 u 2 = xk γ2bgηk(exk) u 2 = xk u 2 2γ2 bgηk(exk), xk u + γ2 2 bgηk(exk) 2. Rearranging the terms gives that gηk(exk), xk u = xk u 2 xk+1 u 2 2γ2 bgηk(exk) gηk(exk), xk u + γ2 2 bgηk(exk) 2. Next we use (14) gηk(exk), xk u = 2 gηk(exk), xk exk + 2 gηk(exk), exk u gηk(exk), bgξk(xk) + 2 gηk(exk), exk u (14) = γ1 gηk(exk) bgξk(xk) 2 bgξk(xk) 2 gηk(exk) 2 gηk(exk), exk u and obtain the following gηk(exk), exk u = xk u 2 xk+1 u 2 2γ2 bgηk(exk) gηk(exk), xk u +γ2 2 bgηk(exk) 2 +γ1γ2 gηk(exk) bgξk(xk) 2 bgξk(xk) 2 gηk(exk) 2 (16) xk u 2 xk+1 u 2 2γ2 bgηk(exk) gηk(exk), xk u +2γ2 2 bgηk(exk) gηk(exk) 2 + 2γ2 2 gηk(exk) 2 +γ1γ2 gηk(exk) bgξk(xk) 2 bgξk(xk) 2 gηk(exk) 2 . If β = γ2/γ1 1/2 gηk(exk), exk u xk u 2 xk+1 u 2 2γ2 bgηk(exk) gηk(exk), xk u +2γ2 2 bgηk(exk) gηk(exk) 2 +γ1γ2 gηk(exk) bgξk(xk) 2 bgξk(xk) 2 . Combining the latter with the result of the following chain gηk(exk) bgξk(xk) 2 (16) 4 gηk(exk) F(exk) 2 + 4 F(exk) F(xk) 2 +4 F(xk) gξk(xk) 2 + 4 gξk(xk) bgξk(xk) 2 (3) 4 gηk(exk) F(exk) 2 + 4L2 exk xk 2 +4 F(xk) gξk(xk) 2 + 4 gξk(xk) bgξk(xk) 2 = 4 gηk(exk) F(exk) 2 + 4 F(xk) gξk(xk) 2 +4 gξk(xk) bgξk(xk) 2 + 4γ2 1L2 bgξk(xk) 2 we obtain if γ1 1/2L gηk(exk), exk u xk u 2 xk+1 u 2 2γ2 bgηk(exk) gηk(exk), xk u +2γ2 2 bgηk(exk) gηk(exk) 2 + 4γ1γ2 gηk(exk) F(exk) 2 +4γ1γ2 F(xk) gξk(xk) 2 + 4γ1γ2 gξk(xk) bgξk(xk) 2.(58) E.3.2 Lipschitz Case Theorem 11. Let Assumptions 1, 3, 5 hold. And let where ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1) by Lemma E.1. Then iterations of SEG-CC (Algorithm 7) satisfy for k 1 E[R2 k] 2R, (61) k=0 E F(exk), exk x R2 where Rk = xk x and R x0 x . Proof. Substituting u = x into the result of Lemma E.2 and taking expectation over ηk one obtains gηk(exk), exk x (16) xk x 2 Eηk h xk+1 x 2i 2γ2Eηk bgηk(exk) gηk(exk), xk x +γ1γ24Eηk h gηk(exk) F(exk) 2i + γ1γ24 F(xk) gξk(xk) 2 +γ1γ24 gξk(xk) bgξk(xk) 2 + 2γ2 2Eηk h bgηk(exk) gηk(exk) 2i To estimate the inner product in the right-hand side we apply Cauchy-Schwarz inequality: 2γ2Eηk bgηk(exk) gηk(exk), xk x 2γ2Eηk xk x bgηk(exk) gηk(exk) 2γ2ρ xk x 1k. Then Assumption 6 implies 2γ2 F(exk), exk x xk x 2 Eηk h xk+1 x 2i +2γ2ρ xk x 1k + 2γ2 2ρ21k + 4γ1γ2σ2 +4γ1γ2 F(xk) gξk(xk) 2 + 4γ1γ2 gξk(xk) bgξk(xk) 2. Taking expectation over ξk one obtains that 2γ2Eξk,ηk F(exk), exk x xk x 2 Eξk,ηk h xk+1 x 2i + 2γ2ρ xk x 1k + 2γ2 2ρ21k |Gk \ Ck| + 4Eξk h F(xk) gξk(xk) 2i + 4Eξk h gξk(xk) bgξk(xk) 2i xk x 2 Eξk,ηk h xk+1 x 2i + 2γ2ρ xk x 1k + 2γ2 2ρ21k |Gk \ Ck| + 4σ2 Gk+ 1 xk x 2 Eξk,ηk h xk+1 x 2i + 2γ2ρ xk x 1k + 2γ2 2ρ21k n 2B m + 4σ2 n 2B m + 4ρ21k 1 Finally taking the full expectation gives that 2γ2E F(exk), exk x E h xk x 2i E h xk+1 x 2i + 2γ2ρE xk x 1k + 2γ2 2ρ2E[1k] n 2B m + 4ρ2E h 1k 1 Summing up the results for k = 0, 1, . . . , K 1 we derive k=0 E F(exk), exk x E xk x 2 E xk+1 x 2 + 8γ1γ2σ2 k=0 E xk x 1k + 2γ2 2ρ2 k=0 E[1k] + 4γ1γ2ρ2 k=0 E h 1k 1 x0 x 2 E[ x K x 2] K + 8γ1γ2σ2 E [ xk x 2] E[1k] + 2γ2 2ρ2 k=0 E[1k] + 4γ1γ2ρ2 k=0 E h 1k 1 Assumption 5 implies that 0 F(x ), exk x F(exk), exk x . Using this and a the notation Rk = xk x , k > 0, R0 x0 x we get 0 R2 0 E[R2 K] K + 8γ1γ2σ2 E [ xk x 2] E[1k] + 2γ2 2ρ2 k=0 E[1k] + 4γ1γ2ρ2 k=0 E h 1k 1 implying (after changing the indices) that E[R2 k] R2 0 + 8γ1γ2σ2k n 2B m + 2γ2ρ E[R2 l ]E[1l] (63) +2γ2 2ρ2 k 1 X l=0 E[1l] + 4γ1γ2ρ2 k 1 X l=0 E h 1l 1 holds for all k 0. In the remaining part of the proof we derive by induction that E[R2 k] R2 0 + 8γ1γ2σ2k n 2B m + 2γ2ρ E[R2 l ]E[1l] (65) +2γ2 2ρ2 k 1 X l=0 E[1l] + 4γ1γ2ρ2 k 1 X l=0 E h 1l 1 i 2R2 0 (66) for all k = 0, . . . , K. For k = 0 this inequality trivially holds. Next, assume that it holds for all k = 0, 1, . . . , T 1, T K 1. Let us show that it holds for k = T as well. From (64) and (66) we have that E[R2 k] 2R2 0 for all k = 0, 1, . . . , T 1. Therefore, E[R2 T ] R2 0 + 8γ1γ2σ2k n 2B m + 2γ2ρ E[R2 l ]E[1l] + 2γ2 2ρ2 k 1 X +4γ1γ2ρ2 k 1 X l=0 E h 1l 1 R2 0 + 8γ1γ2σ2k n 2B m + 2γ2ρR0 E[1l] + 2γ2 2ρ2 k 1 X +4γ1γ2ρ2 k 1 X l=0 E h 1l 1 The latter together with the expected number of at least one peer violations (57) implies E[R2 T ] R2 0 + 8γ1γ2σ2k n 2B m + 2γ2ρR0 n B m + 2γ2 2ρ2 n B m + 4γ1γ2ρ2 n B (n 2B m)R2 0 16σ2K , m R2 0 8ρ2Bn m2R2 0 64ρ2B2n2 , 1 (n 2B m)R2 0 64σ2K , m R2 0 32ρ2Bn m2R2 0 64ρ2B2n2 , (n 2B m)R2 0 64σ2K we satisfy conditions of Lemma E.2 and ensure that 8γ1γ2σ2k n 2B m + 2γ2ρR0 n B m + 2γ2 2ρ2 n B m + 4γ1γ2ρ2 n B m R2 0 4 + R2 0 4 + R2 0 4 + R2 0 4 = R2 0, and, as a result, we get E[R2 T ] 2R2 0 2R. (67) Therefore, (66) holds for all k = 0, 1, . . . , K. Together with (62) it implies k=0 E F(exk), exk x R2 0 γ2 . E.3.3 Quasi-Strongly Monotone Case Lemma E.3. Let Assumptions 3, 4 and Corollary 2 hold. If then gηk(exk) = gηk xk γ1bgξk(xk) satisfies the following inequality γ2 1E h gηk(exk) 2 | xki 2 b Pk + 8γ2 1σ2 G + 4γ2 1ρ21k 1 where b Pk = γ1Eξk,ηk gηk(exk), xk x and ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1) by Lemma E.1. Proof. Using the auxiliary iterate bxk+1 = xk γ1gηk(exk), we get bxk+1 x 2 = xk x 2 2γ1 xk x , gηk(exk) + γ2 1 gηk(exk) 2 (70) = xk x 2 2γ1 xk γbgξk(xk) x , gηk(exk) 2γ2 1 bgξk(xk), gηk(exk) + γ2 1 gηk(exk) 2. (72) Taking the expectation Eξk,ηk [ ] = E | xk conditioned on xk from the above identity, using tower property Eξk,ηk[ ] = Eξk[Eηk[ ]], and µ-quasi strong monotonicity of F(x), we derive Eξk,ηk h bxk+1 x 2i = xk x 2 2γ1Eξk,ηk xk γ1bgξk(xk) x , gηk(exk) 2γ2 1Eξk,ηk bgξk(xk), gηk(exk) + γ2 1Eξk,ηk h gηk(exk) 2i 2γ1Eξk xk γ1bgξk(xk) x , F xk γ1bgξk(xk) 2γ2 1Eξk bgξk(xk), gηk(exk) + γ2 1Eξk,ηk h gηk(exk) 2i (QSM),(14) xk x 2 γ2 1Eξk,ηk h bgξk(xk) 2i +γ2 1Eξk,ηk h bgξk(xk) gηk(exk) 2i . To upper bound the last term we use simple inequality (16), and apply L-Lipschitzness of F(x): Eξk,ηk h bxk+1 x 2i (16) xk x 2 γ2 1Eξk h bgξk(xk) 2i +4γ2 1Eξk h gξk(xk) bgξk(xk) 2i +4γ2 1Eξk h F(xk) F exk 2i +4γ2 1Eξk h gξk(xk) F(xk) 2i +4γ2 1Eξk,ηk h gηk exk F exk 2i (Lip),(27),(28) xk x 2 γ2 1 1 4L2γ2 1 Eξk h bgξk(xk) 2i +4γ2 1Eξk h gξk(xk) bgξk(xk) 2i G + 4γ2 1σ2 (16),Lem. D.1 xk x 2 γ2 1 1 4γ2 1L2 Eξk h bgξk(xk) 2i G + 4γ2 1ρ21k 1 (68) xk x 2 + 8γ2 1σ2 G + 4γ2 1ρ21k 1 Finally, we use the above inequality together with (70): xk x 2 2 b Pk + γ2 1E h gηk(exk) 2 | xki xk x 2 + 8γ2 1σ2 G + 4γ2 1ρ21k 1 where b Pk = γ1Eξk,ηk gηk(exk), xk x . Rearranging the terms, we obtain (69). Lemma E.4. Let Assumptions 3, 4 and Corollary 2 hold. If γ1 1 2µ + 2L, (73) then gηk(exk) = gηk xk γ1bgξk(xk) satisfies the following inequality xk x 2 + γ2 1 4 Eξk h gξk(xk) 2i 8γ2 1σ2 G 9γ2 1ρ21k 1 xk x 2 + 4γ2 1σ2 G + 4γ2 1ρ21k 1 where b Pk = γ1Eξk,ηk gηk(exk), xk x , where ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1) by Lemma E.1. Proof. Since Eξk,ηk[ ] = E[ | xk] and gηk(exk) = gηk xk γ1bgξk(xk) , we have b Pk = γ1Eξk,ηk gηk(exk), xk x = γ1Eξk Eηk[gηk(exk)], xk γ1bgξk(xk) x γ2 1E gηk(exk), bgξk(xk) (14) = γ1Eξk F(xk γ1bgξk(xk)), xk γ1bgξk(xk) x γ2 1 2 Eξk,ηk h gηk(exk) 2i γ2 1 2 Eξk h bgξk(xk) 2i +γ2 1 2 Eξk,ηk h gηk(exk) bgξk(xk) 2i (QSM),(16) µγ1Eξk,ηk h xk x γ1bgξk(xk) 2i γ2 1 2 Eξk h bgξk(xk) 2i +4γ2 1 2 Eξk h gξk(xk) bgξk(xk) 2i +4γ2 1 2 Eξk h F(xk) F exk 2i +4γ2 1 2 Eξk h gξk(xk) F(xk) 2i +4γ2 1 2 Eξk,ηk h gηk exk F exk 2i (17),(Lip),Lem.D.1,Cor.2 xk x 2 γ2 1 2 (1 2γ1µ 4γ2 1L2)Eξk h bgξk(xk) 2i 2g + 4γ2 1σ2 2g + 4γ2 1ρ21k 1 xk x 2 γ2 1 2 Eξk h bgξk(xk) 2i G + 4γ2 1ρ21k 1 So one have xk x 2 γ2 1 4 Eξk h gξk(xk) 2i + 4γ2 1σ2 G + 9γ2 1ρ21k 1 xk x 2 + 4γ2 1σ2 G + 4γ2 1ρ21k 1 that concludes the proof. Combining Lemmas E.3 and E.4, we get the following result. Theorem (Theorem 6 duplicate). Let Assumptions 1, 3 and 4 hold. Then after T iterations SEG-CC (Algorithm 7) with γ1 1 2µ+2L and β = γ2/γ1 1/4 outputs x T such that E x T x 2 1 µβγ1 T x0 x 2 + 2σ2 4γ1 βµ2(n 2B m) + γ1qn B where q = 2C2 + 12 + 12 n 2B m; q = O(1) since C = O(1). Proof of Theorem 6. Since xk+1 = xk γ2bgηk exk , we have xk+1 x 2 = xk γ2bgηk exk x 2 = xk x 2 2γ2 bgηk exk , xk x + γ2 2 bgηk exk 2 xk x 2 2γ2 gηk exk , xk x + 2γ2 2 gηk exk 2 +2γ2 2 gηk exk bgηk exk 2 + 2γ2 gηk exk bgηk exk , xk x (1 + λ) xk x 2 2γ2 gηk exk , xk x + 2γ2 2 gηk exk 2 gηk exk bgηk exk 2 Taking the expectation, conditioned on xk, Eξk,ηk xk+1 x 2 (1 + λ) xk x 2 2βγ1Eξk,ηk gηk exk , xk x +2β2γ2 1Eξk,ηk gηk exk 2 + γ2 2ρ2 2 + 1 using the definition of b Pk = γ1Eξk,ηk gηk(exk), xk x , we continue our derivation: Eξk,ηk h xk+1 x 2i = (1 + λ) xk x 2 2β b Pk + 2β2γ2 1Eξk,ηk gηk exk 2 +γ2 2ρ2 2 + 1 (69) (1 + λ) xk x 2 2β b Pk + 2β2 2 b Pk + 8γ2 1σ2 G + 4γ2 1ρ21k 1 +γ2 2ρ2 2 + 1 0 β 1/2 (1 + λ) xk x 2 2 b Pk β 2β2 + 16γ2 2σ2 G + 8γ2 2ρ21k 1 +γ2 2ρ2 2 + 1 (74) (1 + λ) xk x 2 +2β(1 2β) µγ1 xk x 2 + 4γ2 1σ2 G + 4γ2 1ρ21k 1 G + 8γ2 2ρ21k 1 2 + γ2 2ρ2 2 + 1 1 + λ 2β(1 2β)µγ1 G + γ2 1ρ21k 1 2 + 16γ2 2σ2 G + 8γ2 2ρ21k 1 2 + γ2 2ρ2 2 + 1 0 β 1/4 1 + λ µγ2 G γ2 1 + 16γ2 2 + γ2 1ρ21k 1 2 + 8γ2 2ρ21k 1 2 + γ2 2ρ2 2 + 1 λ=µγ2/4 1 µγ2 G γ2 1 + 16γ2 2 + γ2 1ρ21k 1 2 + 8γ2 2ρ21k 1 2 + γ2 2ρ2 2 + 4 µγ2 Next, taking the full expectation from the both sides and obtain E h xk+1 x 2i 1 µγ2 E h xk x 2i G γ2 1 + 16γ2 2 + ρ2 γ2 1 + 8γ2 2 E1k 1 2 + 2ρ2 γ2 2 + 2γ2 Unrolling the recurrence, we derive the rest of the result: E x K+1 x 2 1 µγ2 K+1 x0 x 2 + 4σ2 γ2 1 + 16γ2 2 γ2µ(n 2B m) +ρ2 γ2 1 + 8γ2 2 K X +2ρ2 γ2 2 + 2γ2 µ and that implies using the expected number of at least one peer violations (57) we derive E x K+1 x 2 1 µγ2 K+1 x0 x 2 + 4σ2 γ2 1 + 16γ2 2 γ2µ2(n 2B m) +ρ2 γ2 1 + 8γ2 2 n B m + 2ρ2 γ2 2 + 2γ2 K+1 x0 x 2 + 4σ2 γ2 1 + 16γ2 2 γ2µ(n 2B m) +ρ2 γ2 1 + 10γ2 2 n B that together with ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1) by Lemma E.1 give result of the theorem. Corollary 9. Let assumptions of Theorem 6 hold. Then E x T x 2 ε holds after β2µ2(n 2B m)ε + qσ2Bn βµ2mε + qσ2Bn β2µ2m ε iterations of SEG-CC with 1 2L + 2µ, 4 ln max n 2, min n m(n 2B m)β2µ2R2K 32mσ2+4qσ2β2n B(n 2B m), mµ2β2R2K2 32qn Bσ2 oo Proof. Next, we plug γ2 = βγ1 γ1/4 into the result of Theorem 6 and obtain E h xk+1 x 2i 1 µβγ1 x0 x 2 + 8σ2γ1 βµ(n 2B m) +2ρ2γ2 1 n B Using the definition of ρ (ρ2 = qσ2 = O(σ2)) from Lemma E.1 and if B n 4 , m << n the result of Theorem 6 can be simplified as E x T x 2 1 µβγ1 T x0 x 2 + 8σ2γ1 βµ(n 2B m) +2qσ2γ2 1 n B m + 4qσ2βγ1 Applying Lemma C.4 to the last bound we get the result of the corollary. E.3.4 Lipschitz Monotone Case Theorem 12. Suppose the assumptions of Theorem 11 and Assumption 5 hold. Then after K iterations of SEG-CC (Algorithm 7) E h Gap BR(x ) 2γ2K , (77) where Gap BR(x ) x K = max u BR(x ) F(u), x K u , x K = 1 k=0 exk and R x0 x . Proof. We start the proof with the result of Lemma E.2 gηk(exk), exk u xk u 2 xk+1 u 2 2γ2 bgηk(exk) gηk(exk), xk u +2γ2 2 bgηk(exk) gηk(exk) 2 + 4γ1γ2 gηk(exk) F(exk) 2 +4γ1γ2 F(xk) gξk(xk) 2 + 4γ1γ2 gξk(xk) bgξk(xk) 2, that leads to the following inequality 2γ2 F(exk), exk u xk u 2 xk+1 u 2 + 2γ2 2 bgηk(exk) gηk(exk) 2 +2γ2 F(exk) gηk(exk), exk u 2γ2 bgηk(exk) gηk(exk), xk u +4γ1γ2 gηk(exk) F(exk) 2 + F(xk) gξk(xk) 2 + gξk(xk) bgξk(xk) 2 . Assumption 5 implies that F(u), exk u F(exk), exk u (78) and consequently by Jensen inequality 2γ2K F(u), x K u x0 u 2 + 2γ2 2 bgηk(exk) gηk(exk) 2 F(exk) gηk(exk), exk u bgηk(exk) gηk(exk), xk u gηk(exk) F(exk) 2 + F(xk) gξk(xk) 2 + gξk(xk) bgξk(xk) 2 , where x K = 1 Then maximization in u gives 2γ2KGap BR(x ) max u BR(x ) x0 u 2 + 2γ2 2 bgηk(exk) gηk(exk) 2 gηk(exk) F(exk) 2 + F(xk) gξk(xk) 2 + gξk(xk) bgξk(xk) 2 +2γ2 max u BR(x ) F(exk) gηk(exk), exk u +2γ2 max u BR(x ) gηk(exk) bgηk(exk), xk u . By Lemma E.1 bgηk(exk) gηk(exk) 2 ! 2γ2 2ρ2 K 1 X k=0 E[1k] 2γ2 2ρ2 n B gξk(xk) bgξk(xk) 2 ! 4γ1γ2ρ2 K 1 X k=0 E h 1k 1 i 4γ1γ2ρ2 n B (59),(60) R2 By Corollary 2 gηk(exk) F(exk) 2 + F(xk) gξk(xk) 2 ! 8γ1γ2K n 2B m (59),(60) R2 2γ2 max u BR(x ) gηk(exk) bgηk(exk), xk u 2γ2 max u BR(x ) gηk(exk) bgηk(exk), xk x +2γ2 max u BR(x ) gηk(exk) bgηk(exk), x u xk x bgηk(exk) gηk(exk) +2γ2 max u BR(x ) k=0 x u bgηk(exk) gηk(exk) Taking the full expectation of both sides of the result of the previous chain we derive 2γ2E max u BR(x ) gηk(exk) bgηk(exk), xk u 6γ2ρR0E1k 6γ2ρR0 n B Now the last term 2γ2 max u BR(x ) F(exk) gηk(exk), exk u (82) Following Beznosikov et al. [2023] one can derive the bound for the next term: k=0 F(exk) gηk(exk), exk k=0 E[F(exk) gηk(exk) | exk], exk k=0 F(exk) gηk(exk), x0 E[F(exk) gηk(exk)], x0 = 0, max u BR(x ) k=0 F(exk) gηk(exk), exk u k=0 F(exk) gηk(exk), exk max u BR(x ) k=0 F(exk) gηk(exk), u max u BR(x ) k=0 F(exk) gηk(exk), u k=0 F(exk) gηk(exk), x0 max u BR(x ) k=0 F(exk) gηk(exk), u max u BR(x ) k=0 (F(exk) gηk(exk)), x0 u max u BR(x ) k=0 (F(exk) gηk(exk)) + 1 2γ2 x0 u 2 k=0 (F(exk) gηk(exk)) + max u BR(x ) x0 u 2. We notice that E[F(exk) gηk(exk) | F ex0 gη0 ex0 , . . . , F exk 1 gηk 1 exk 1 ] = 0 for all k 1, i.e., conditions of Lemma C.2 are satisfied. Therefore, applying Lemma C.2, we get max u BR(x ) k=0 F(exk) gηk(exk), exk u k=0 E[ F(exk) gηk(exk) 2] (84) + max u BR(x ) x0 u 2 (85) n 2B m + max u BR(x ) x0 u 2 (86) (59),(60) 9 8R2. (87) Assembling the above results together gives 2γ2KEGap BR(x ) 8R2 3R2. (88) Corollary 10. Let assumptions of Theorem 12 hold. Then E h Gap BR(x ) x K i ε holds after iterations of SEG-CC. E h Gap BR(x ) 64σ2K (n 2B m)R2 + (n 2B m)K + 12ρBn R Let us chose K such that each of the last three terms less or equal ε/3, then K = max 18LR2 ε , 144 9σ2R2 (n 2B m)ε2 , 36ρBn R where ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1) by Lemma E.1. The latter implies that E h Gap BR(x ) Using the definition of ρ from Lemma E.1 and if B n 4 , m << n the bound for K can be easily derived. E.4 Proofs for R-SEG-CC E.4.1 Quasi Strongly Monotone Case Algorithm 8 R-SEG-CC Input: x0 starting point, r number of restarts, {γt}r t=1 stepsizes for SEG-CC (Alg. 7), {Kt}r t=1 number of iterations for SEG-CC (Alg. 7), 1: bx0 = x0 2: for t = 1, 2, . . . , r do 3: Run SEG-CC (Alg. 7) for Kt iterations with stepsize γt, starting point bxt 1, 4: Define bxt as bxt = 1 Kt k=0 xk,t, where x0,t, x1,t, . . . , x Kt,t are the iterates produced by SEG-CC . 5: end for Output: bxr Theorem (Theorem 7 duplicate). Let Assumptions 1, 3, 4 hold. Then, after r = l log2 R2 restarts R-SEG-CC (Algotithm 8) with γ1t = min 1 2L, q 16σ22t Kt , q m R2 8qσ22t Bn min 1 4L, q m2R2 64qσ22t B2n2 , q and Kt = max 8L mµR , 256σ22t (G B m)µ2R2 where R x0 x outputs bxr such that E bxr x 2 ε. Moreover, the total number of executed iterations of SEG-CC is t=1 Kt = O ℓ µ log µR2 0 ε + σ2 (n 2B m)µε + n Bσ Proof of Theorem 7. x K = 1 µE h x K x 2i = µE k=0 E h exk x 2i (QSM) 1 K k=0 E F(exk), exk x . Theorem 11 implies that SEG-CC with µE h x K x 2i R2 0 γ2K after K iterations. After the first restart we have E h bx1 x 2i R2 0 µγ21K1 R2 0 2 , since µγ21K1 2. Next, assume that we have E[ bxt x 2] R2 0 2t for some t r 1. Then, Theorem 11 implies that E h bxt+1 x 2 | bxti bxt x 2 Taking the full expectation from the both sides of previous inequality we get E h bxt+1 x 2i E[ bxt x 2] µγ2t Kt R2 0 2tµγ2t Kt R2 0 2t+1 . Therefore, by mathematical induction we have that for all t = 1, . . . , r E bxt x 2 R2 0 2t . Then, after r = l log2 R2 0 ε m 1 restarts of SEG-CC we have E bxr x 2 ε. The total number of iterations executed by SEG-CC is ( L µ , σ22t (n 2B m)µ2R2 0 , n Bρ2 t 2 mµR0 (n 2B m)µ2R2 0 + n Bρ2 r 2 mµR0 L µ log µR2 0 ε + σ2 (n 2B m)µ2R2 0 µR2 0 ε + n Bρ µ log µR2 0 ε + σ2 (n 2B m)µε + n Bρ that together with ρ2 = qσ2 with q = 2C2 + 12 + 12 n 2B m and C = O(1) given by Lemma E.1 implies the result of the theorem. Corollary 11. Let assumptions of 7 hold. Then E bxr x 2 ε holds after t=1 Kt = O L nµε + n2σ m µε iterations of SEG-CC. Proof. Using the definition of ρ from Lemma E.1 and if B n 4 , m << n the bound for r P be easily derived. F Extra Experiments and Experimental details F.1 Quadratic games Firstly, let us clarify notations made in (7): , Ai = A1,i A2,i A2,i A3,i , bi = b1,i b2,i with symmetric matrices Aj,i s.t. µI Aj,i ℓI, Ai Rd d and bi Rd. Data generation. For each j we sample real random matrix Bi with elements sampled from a normal distribution. Then we compute the eigendecomposition and obtain the following Bi = UT i Di Ui with diagonal Di. Next, we scale Di and obtain b Di with elements lying in [µ, ℓ]. Finally we compose Aj,i as Aj,i = UT i b Di Ui. This process ensures that eigenvalues of Aj,i all lie between µ and ℓ, and thus F(x) is strongly monotone and cocoercive. Vectors bi Rd are sampled from a normal distribution with variance 10/d. Experimental setup. For all the experiments we choose ℓ= 100, µ = 0.1, s = 1000 and d = 50. For the experiments presented in the Appendix we simulate n = 20 nodes on a single machine and B = 4. For methods with checks of computations the only one peer attacks per iteration. We used RFA with 5 buckets bucketing as an aggregator since it showed the best performance. For approximating the median we used Weiszfeld s method with 10 iterations and parameter ν = 0.1 Pillutla et al. [2022]. RDEG [Adibi et al., 2022] provably works only if n 100 we manually selected parameter ϵ = 0.5 using a grid-search and picking the best performing value. We present experiments with different attack (bit flipping (BF), random noise (RN), inner product manipulation (IPM) Xie et al. [2019] and a little is enough (ALIE) Baruch et al. [2019]) and different batchsizes (bs) 1, 10 and 100. If an attack has a parameter it is specified in the brackets on each plot. No checks. Firstly we provide a detailed comparison between methods that do not check computation with fixed learning rate value γ = 3.3e 5. Code for quadratic games is available at https://github.com/nazya/sgda-ra7. 7Code is based on https://github.com/hugobb/sgda 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution ALIE (1e-01), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution ALIE (1e-01), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution ALIE (1e-01), bs=100 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution ALIE (1e+00), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution ALIE (1e+00), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution ALIE (1e+00), bs=100 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution ALIE (1e+01), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution ALIE (1e+01), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution ALIE (1e+01), bs=100 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution ALIE (1e+02), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution ALIE (1e+02), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution ALIE (1e+02), bs=100 RDEG SEGRA MSGDARA SGDARA Figure 3: Distance to the solution ALIE attack with various of parameter values and batchsizes. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution RDEG SEGRA MSGDARA SGDARA Figure 4: Distance to the solution BF attack with various batchsizes. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution IPM (1e-01), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution IPM (1e-01), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution IPM (1e-01), bs=100 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution IPM (1e+00), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution IPM (1e+00), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution IPM (1e+00), bs=100 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution IPM (1e+01), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution IPM (1e+01), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution IPM (1e+01), bs=100 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution IPM (1e+02), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution IPM (1e+02), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution IPM (1e+02), bs=100 RDEG SEGRA MSGDARA SGDARA Figure 5: Distance to the solution under IPM attack with various parameter values and batchsizes. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution RN (1e+01), bs=1 RDEG SEGRA MSGDARA SGDARA SEGRA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution RN (1e+01), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution RN (1e+01), bs=100 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution RN (1e+02), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution RN (1e+02), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution RN (1e+02), bs=100 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution RN (1e+03), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution RN (1e+03), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution RN (1e+03), bs=100 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution RN (1e+04), bs=1 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution RN (1e+04), bs=10 RDEG SEGRA MSGDARA SGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution RN (1e+04), bs=100 RDEG SEGRA MSGDARA SGDARA Figure 6: Distance to the solution under RN attack with various parameter values and batchsizes. Checks of Computations. Next, using the same setup, we compare M-SGDA-RA, which showed the best performance in the above experiments, with methods that check computations. With checks of computation the best strategy for attackers is that at each iteration only one peer attacks, since it maximizes the expected number of rounds with the presence of actively malicious peers. So the comparison in this paragraph is performed in this setup. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution ALIE (1e-01), bs=1 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution ALIE (1e-01), bs=10 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution ALIE (1e-01), bs=100 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution ALIE (1e+00), bs=1 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution ALIE (1e+00), bs=10 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution ALIE (1e+00), bs=100 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution ALIE (1e+01), bs=1 SEGCC SGDACC MSGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution ALIE (1e+01), bs=10 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution ALIE (1e+01), bs=100 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution ALIE (1e+02), bs=1 SEGCC SGDACC MSGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution ALIE (1e+02), bs=10 SEGCC SGDACC MSGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution ALIE (1e+02), bs=100 SEGCC SGDACC MSGDARA Figure 7: Distance to the solution under ALIE attack with various parameter values and batchsizes. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution SEGCC SGDACC MSGDARA SGDA Figure 8: Distance to the solution under BF attack with various batchsizes. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution IPM (1e-01), bs=1 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution IPM (1e-01), bs=10 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution IPM (1e-01), bs=100 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution IPM (1e+00), bs=1 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution IPM (1e+00), bs=10 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution IPM (1e+00), bs=100 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution IPM (1e+01), bs=1 SEGCC SGDACC MSGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution IPM (1e+01), bs=10 SEGCC SGDACC MSGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution IPM (1e+01), bs=100 SEGCC SGDACC MSGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution IPM (1e+02), bs=1 SEGCC SGDACC MSGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution IPM (1e+02), bs=10 SEGCC SGDACC MSGDARA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution IPM (1e+02), bs=100 SEGCC SGDACC MSGDARA Figure 9: Distance to the solution under IPM attack with various parameter values and batchsizes. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution RN (1e+01), bs=1 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution RN (1e+01), bs=10 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution RN (1e+01), bs=100 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution RN (1e+02), bs=1 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution RN (1e+02), bs=10 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution RN (1e+02), bs=100 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution RN (1e+03), bs=1 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution RN (1e+03), bs=10 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution RN (1e+03), bs=100 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e5 Distance to the solution RN (1e+04), bs=1 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e6 Distance to the solution RN (1e+04), bs=10 SEGCC SGDACC MSGDARA SGDA 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Number of gradients 1e7 Distance to the solution RN (1e+04), bs=100 SEGCC SGDACC MSGDARA SGDA Figure 10: Distance to the solution under RN attack with various parameter values and batchsizes. Learning rates. We conducted extra experiments to show the dependence on different learning rate values 1e 5, 2e 5, 5e 6. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Number of gradients 1e6 Distance to the solution ALIE (1e+01), bs=10 RDEG (2e-05) SEGRA (2e-05) MSGDARA (2e-05) SGDARA (2e-05) RDEG (1e-05) SEGRA (1e-05) MSGDARA (1e-05) SGDARA (1e-05) RDEG (5e-06) SEGRA (5e-06) MSGDARA (5e-06) SGDARA (5e-06) 0 1 2 3 4 5 6 7 Number of gradients 1e7 Distance to the solution IPM (1e+02), bs=100 RDEG (2e-05) SEGRA (2e-05) MSGDARA (2e-05) SGDARA (2e-05) RDEG (1e-05) SEGRA (1e-05) MSGDARA (1e-05) SGDARA (1e-05) RDEG (5e-06) SEGRA (5e-06) MSGDARA (5e-06) SGDARA (5e-06) 0 50000 100000 150000 200000 250000 300000 Number of gradients Distance to the solution RN (1e+03), bs=1 RDEG (2e-05) SEGRA (2e-05) MSGDARA (2e-05) SGDARA (2e-05) RDEG (1e-05) SEGRA (1e-05) MSGDARA (1e-05) SGDARA (1e-05) RDEG (5e-06) SEGRA (5e-06) MSGDARA (5e-06) SGDARA (5e-06) 0.0 0.2 0.4 0.6 0.8 1.0 Number of gradients 1e7 Distance to the solution RDEG (2e-05) SEGRA (2e-05) MSGDARA (2e-05) SGDARA (2e-05) RDEG (1e-05) SEGRA (1e-05) MSGDARA (1e-05) SGDARA (1e-05) RDEG (5e-06) SEGRA (5e-06) MSGDARA (5e-06) SGDARA (5e-06) 0.0 0.2 0.4 0.6 0.8 1.0 Number of gradients 1e7 Distance to the solution Scaled BF, bs=10 SEGCC (2e-05) SGDACC (2e-05) MSGDARA (2e-05) SEGCC (1e-05) SGDACC (1e-05) MSGDARA (1e-05) SEGCC (5e-06) SGDACC (5e-06) MSGDARA (5e-06) 0 1 2 3 4 5 Number of gradients 1e6 Distance to the solution Smoothed ALIE (1e+01), bs=10 SEGCC (2e-05) SGDACC (2e-05) MSGDARA (2e-05) SGDA (2e-05) SEGCC (1e-05) SGDACC (1e-05) MSGDARA (1e-05) SGDA (1e-05) SEGCC (5e-06) SGDACC (5e-06) MSGDARA (5e-06) SGDA (5e-06) 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Number of gradients 1e6 Distance to the solution Scaled ALIE (1e+01), bs=10 SEGCC (2e-05) SGDACC (2e-05) SEGCC (1e-05) SGDACC (1e-05) SEGCC (5e-06) SGDACC (5e-06) 0 1 2 3 4 5 Number of gradients 1e7 Distance to the solution Smoothed IPM (1e+02), bs=100 SEGCC (2e-05) SGDACC (2e-05) SEGCC (1e-05) SGDACC (1e-05) SEGCC (5e-06) SGDACC (5e-06) MSGDARA (2e-05) MSGDARA (1e-05) MSGDARA (5e-06) 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Number of gradients 1e7 Distance to the solution Scaled IPM (1e+02), bs=100 SEGCC (2e-05) SGDACC (2e-05) SEGCC (1e-05) SGDACC (1e-05) SEGCC (5e-06) SGDACC (5e-06) 0 100000 200000 300000 400000 500000 600000 Number of gradients Distance to the solution RN (1e+03), bs=1 SEGCC (2e-05) SGDACC (2e-05) MSGDARA (2e-05) SEGCC (1e-05) SGDACC (1e-05) MSGDARA (1e-05) SEGCC (5e-06) SGDACC (5e-06) MSGDARA (5e-06) Figure 11: Distance to the solution under various attacks, batchsizes (bs) and learning rates (lr). F.2 Generative Adversarial Networks One of the most well-known frameworks in which the objective function is a variational inequality is generative adversarial networks (GAN) Goodfellow et al. [2014]. In the simplest case of this setting, we have a generator G : Rz Rd and a discriminator D : Rd R, where z denotes the dimension of the latent space. The objective function can be written as min G max D E x log(D(x)) + E z log(1 D(G(z))). (91) Here, it is understood that D and G are modeled as neural nets and can be optimized in the distributed setting with gradient descent ascent algorithms. However, due to the complexity of the GANs framework, tricks and adjustments are being employed to ensure good results, such as the Wasserstein GAN formulation [Gulrajani et al., 2017] with Lipschitz constraint on D and the spectral normalization [Miyato et al., 2018] trick to ensure the Lipschitzness of D. The discriminator can thus benefit in practice from multiple gradient ascent steps per gradient descent step on the generator. In addition, Adam [Kingma and Ba, 2014] is often used for GANs as they can be very slow to converge and not perform as well with vanilla SGD. Therefore, in our implementation of GANs in the distributed setting, we employ all of these techniques and show improvements when we add checks of gradient computations to the server. As for the gradients in our implementation, we can think of the accumulated Adam steps within the clients as generalized gradients and aggregate them in the server with checks of computations (by rewinding model and optimizer state). We tried aggregation after each descent or ascent step, after full descentascent step, and after multiple descent-ascent steps. For the first case, we found that GANs converge much more slowly. For the third case, the performance is better but checks of computations take more time. Thus, we choose to report the performance for the second case: aggregations of a full descent-ascent step. Though, we note that experiments for the other cases suggest similar improvements. The dataset we chose for this experiment is CIFAR-10 [Krizhevsky et al., 2009] because it is more realistic than MNIST yet is still tractable to simulate in the distributed setting. We let n = 10, B = 2, and choose a learning rate of 0.001, β1 = 0.5, and β2 = 0.9 with a batch size of 64. We run the algorithms for 4600 epochs. We could not average across runs as the simulation is very compute intensive and the benefits are obvious. We compare SGDA-RA (RFA with bucket size 2) and SGDA-CC under the following byzantine attacks: i) no attack (NA), ii) label flipping (LF), iii) inner product manipulation (IPM) [Xie et al., 2019], and iv) a little is enough (ALIE) [Baruch et al., 2019]. The architecture of the GAN follows Miyato et al. [2018]. We show the results in Figure 12. The improvements are most significant for the ALIE attack. Even when no attacks are present, checks of computations only slightly affects convergence speed. This experiment should further justify our proposed algorithm and its real-world benefits even for a setting as complex as distributed GANs. Code for GANs is available at https://github.com/zeligism/ vi-robust-agg. Figure 12: Comparison of FID to CIFAR-10 per epoch between SGDA-CC and SGDA-RA. The FID is calculated on 50,000 samples. (lower = better).