# byzantine_stochastic_gradient_descent__8e209b82.pdf

Byzantine Stochastic Gradient Descent

Dan Alistarh

IST Austria dan.alistarh@ist.ac.at

Zeyuan Allen-Zhu Microsoft Research AI zeyuan@csail.mit.edu

Jerry Li Simons Institute jerryzli@berkeley.edu

This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of m machines which allegedly compute stochastic gradients every iteration, an -fraction are Byzantine, and may behave adversarially. Our main result is a variant of stochastic gradient descent (SGD) which ﬁnds "-approximate minimizers of convex functions in T = e O

" 1 "2m + 2

iterations. In contrast, traditional mini-batch SGD needs T = O

iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sample complexity and time complexity.

1 Introduction

Machine learning applications are becoming increasingly decentralized, either because data is naturally distributed in applications such as federated learning [17] or because data is partitioned across machines to parallelize computation, e.g. [2]. Fault-tolerance is a critical concern in such distributed settings. Machines in a data center may crash, or fail in unpredictable ways; even worse, in some settings one must be able to tolerate a fraction of adversarial/faulty workers, sending corrupted or even malicious data. This Byzantine failure model where a small fraction of bad workers are allowed to behave arbitrarily has a rich history in distributed computing [19]. By contrast, the design of machine learning algorithms which are robust to such Byzantine failures is a relatively recent topic, but is rapidly becoming a major research direction at the intersection of machine learning, distributed computing, and security. We measure algorithms in this setting against two fundamental criteria: sample complexity, which requires high accuracy from few data samples, and computational complexity, i.e. preserving the runtime speedups achieved by distributing computation. These criteria should hold even under adversarial conditions. Another important consideration in the design of these algorithms is that they should remain useful in high dimensions.

System Model. We study stochastic optimization in the Byzantine setting. We assume an unknown distribution D over functions Rd ! R, and wish to minimize f(x) := Es D[fs(x)]. We consider a standard setting with m workers and a master (coordinator), where an -fraction of the workers may be Byzantine, with < 1/2. Each worker has access to T sample functions from the distribution D. We proceed in iterations, structured as follows: workers ﬁrst perform some local computation, then synchronously send information to the master, which compiles the information and sends new information to the workers. At the end, the master should output an approximate minimizer of the function f. While our negative results will apply for this general setting, our algorithms will be expressed in the standard framework of distributed stochastic gradient methods: in each iteration k, the master broadcasts the current iterate xk 2 Rd to worker machines, and each worker is supposed to compute a stochastic gradient at xk and return it to the master. A good worker returns returns rfs(xk) for a

Authors in alphabetical order. Full version can be found on https://arxiv.org/abs/1803.08917.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada.

random sample s D, but a Byzantine worker machine may adversarially return any vector. This stochastic optimization framework is general and very well studied, and captures many important problems such as regression, learning SVMs, logistic regression, and training deep neural networks. Traditional methods such as mini-batch stochastic gradient descent (SGD) are vulnerable to even a single Byzantine failure. Our results are presented in the master-worker distribution model, but can be generalized to a coordinator-free distributed setting using standard techniques [12], assuming authenticated point-to-point channels. In this setting, sample complexity is measured as the number of functions fs( ) we accessed. Since every machine gets one sample per iteration, minimizing sample complexity is equivalent to minimizing the number of iterations. Time complexity is determined by the number of iterations.

Our Results. In this work, we study the convex formulation of this Byzantine stochastic optimization problem: we assume f(x) is convex, although each of the functions fs(x) may not necessarily be convex. We provide the ﬁrst algorithms that, in the presence of Byzantine machines, guarantee the following, up to logarithmic and lower-order terms:

(1) achieve optimal sample complexity,

(2) achieve optimal number of stochastic gradient computations,

(3) match the sample and time complexities of traditional SGD as ! 0, and

(4) achieve (1)-(3) even as the dimension grows, without losing additional dimension factors.

In addition, our algorithms are optimally-robust, supporting a fraction of < 1/2 Byzantine workers. Despite signiﬁcant recent interest, e.g. [6, 8, 13, 26, 27, 30, 31], to the best of our knowledge, prior to our work there were no algorithms for stochastic optimization in high dimensions that achieved any of the four objectives highlighted above. Previous algorithms either provided weak robustness guarantees, or had sample or time complexities which degrade polynomially with the dimension d or with the error ".

Technical Contribution. A direct way to deal with Byzantine workers is to perform a robust aggregation step to compute gradients, such as median of means: for each (good) worker machine i 2 [m], whenever a query point xk is provided by the master, the worker takes n stochastic gradient samples and computes their average, which we call vi. If n = e (" 2), one can show that for each good machine i, it holds that kvi rf(xk)k " with high probability. Therefore, in each iteration k, we can determine a vector vmed 2 {v1, . . . , vm} satisfying kvmed rf(xk)k 2", and move in the negative direction of vmed. However, the above idea requires too many computations of stochastic gradients. In the non-strongly convex setting, each worker machine needs to compute " 2 stochastic gradients per iteration, and the overall number of iterations will be at least " 1. This is because, even when fs(x) = f(x) and = 0, gradient descent converges in " 1 iterations. This amounts to a sample complexity of linear dependency in " 3. We take a different approach. We run the algorithm for T iterations, where each machine i 2 [m] only computes one stochastic gradient per iteration. Let v(k)

i be the stochastic gradient allegedly computed by machine i 2 [m] at iteration k 2 [T]. By martingale concentration, Bi := (v1

i + + v(T )

i )/T should concentrate around B? := (rf(x1) + + rf(x T ))/T for each good machine i, up to an additive error 1 p

T . Hence, if k Bi B?k > 1/

T for machine i, we can safely declare that i is Byzantine. Two non-trivial technical obstacles remain. First, we cannot restart the algorithm every time we discover a new Byzantine machine, since that would ruin its time complexity. Second, Byzantine machines may successfully disguise themselves by not violating the above criterion. To address the ﬁrst issue, we keep track of the quantity

at each step k; if a machine strays away too much from B(k)

? , it is labeled as Byzantine, and removed from future consideration. We prove that restarts are not necessary. For the second problem, we

algorithm # sampled functions

per machine total work per-iteration per-machine work

SGD ( = 0 only) (folklore) O

Byzantine SGD (Theorem 3.2) e O

" + 1 "2m + 2

" + 1 "2 + 2m

GD ( = 0 only) (folklore) e O

Median-GD (Yin et al. [31]) e O

1 + d "2m + 2

" + d "3 + 2m

! d "2m + 2

folklore (c.f. [29, Theorem 11])

this paper (Theorem 4.3)

! 1 "2m + 2

convex σ-strongly convex

SGD ( = 0 only) (folklore) O

Byzantine SGD (Theorem 3.4) e O

σ + 1 σ"m + 2

σ + 1 σ" + 2m

GD ( = 0 only) (folklore) O

Median-GD (Yin et al. [31]) e O

1 + d σ"m + 2

σ + d σ2" + 2m

! d σ"m + 2

folklore (c.f. [29, Appendix C.5])

this paper (Theorem 4.4)

! 1 σ"m + 2

Table 1: Comparison of Byzantine optimization for smooth convex minimization f(x) = Es D[fs(x)].

Remark 1. In this table, we have hidden parameters L (smoothness), V (variance), and D (diameter). The goal is to achieve f(x) f(x ) ", and σ is the strong convexity parameter of f(x). Remark 2. # sampled functions is the number of fs( ) to sample for each machine. Remark 3. total/per-iteration work is in terms of the # of stochastic gradient computations rfs( ).

construct a similar safety criterion, in terms of the sequence

i , x1 x0i + + hv(k)

i , xk x0i k .

We prove that good machines will satisfy both criteria; more importantly, any Byzantine machine which satisﬁes both of them must have negligible negative inﬂuence in the algorithm s convergence.

Related Work. The closest work to ours is the concurrent and independent work of Yin et al. [31]. They consider a similar Byzantine model, but for gradient descent (GD). In their algorithm, each of the m machines receives n samples of functions upfront. In an iteration k, machine i allegedly computes n stochastic gradients at point xk and averages them (the n stochastic gradients are taken with respect to the n sampled functions stored on machine i). Then, their proposed algorithm aggregates all the average vectors from the m machines, and performs a coordinate-wise median operation to determine the descent direction. In contrast, our algorithm is a Byzantine variant of SGD: a total of Tm functions are sampled and a total of Tm stochastic gradient computations are performed. To be robust against Byzantine machines, they average stochastic gradients within a single iteration and compare them across machines. In contrast, we average stochastic gradients (and other quantities) across iterations. Further, in terms of sample complexity (i.e., the number of functions fs( ) to be sampled), their algorithm s complexity is higher by a linear factor in the dimension d (see Table 1). This is in large part due to their coordinate-wise median operation. In high dimensions, this leads to sub-optimal statistical rates. In terms of total computational complexity, each iteration of Yin et al. [31] requires a full pass over the (sampled) dataset. In contrast, an entire run of Byzantine SGD requires only one pass. Finally, their algorithm works under a weaker set of assumptions than ours. They assumed that the stochastic error in gradients (namely, rfs(x) rf(x)) has bounded variance and skewness; in contrast, we only assume that rfs(x) rf(x) is bounded with probability 1. Our stronger assumption (which is standard) turns out to simplify our algorithm and analysis. We leave it as future work to extend Byzantine SGD to bounded skewness. Yin et al. [31] also provided a lower bound in terms of sampling complexity the number of functions fs( ) needed to be sampled in the presence of Byzantine machines. When translated to

our language, the result is essentially the same as the strongly convex part of Theorem 4.4. The results in this paper are the ﬁrst to cover the case of non-strongly convex functions.

Byzantine Stochastic Optimization. There has been a lot of recent work on Byzantine stochastic optimization, and in particular, SGD [6, 8, 13, 26, 27, 30]. One of the ﬁrst references to consider this setting is Feng et al. [13], which investigated distributed PCA and regression in the Byzantine distributed model. Their general framework has each machine running a robust learning algorithm locally, and aggregating results via a robust estimator. However, the algorithm requires careful parametrization of the sample size at each machine to obtain good error bounds, which renders it suboptimal with respect to sample complexity. Our work introduces new techniques which address both these limitations. Su and Vaidya [26, 27] consider a similar setting: in Su and Vaidya [26] they focus on the single-dimensional (d = 1) case, whereas Su and Vaidya [27] considers the multidimensional setting, but only consider a restricted family of consensus-based algorithms. Blanchard et al. [6] propose a general Byzantine-resilient gradient aggregation rule called Krum for selecting a valid gradient update. This rule has local complexity O(m2(d + log m)), which makes it relatively expensive to compute when the d and/or m are large. Moreover, in each iteration the algorithm chooses a gradient corresponding to a constant number of correct workers, so the scheme does not achieve speedup with respect to the number of distributed workers, which negates an important beneﬁt of distributed training. Xie et al. [30] consider gradient aggregation rules in a generalized Byzantine setting where a subset of the messages sent between servers can be corrupted. The complexity of their selection rule can be as low as e O(dm), but their approach is far from sampleoptimal. Chen et al. [8] leverage the geometric median of means idea in a novel way, which allows it to be signiﬁcantly more sample-efﬁcient, and applicable for a wider range of parameters. At the same time, their technique only applies in the strongly convex setting, and is suboptimal in terms of convergence rate by a factor of p m.

Adversarial Noise. Optimization and learning in the presence of adversarial noise is a well-studied problem [4, 5, 15, 20, 22, 28]. Recently, efﬁcient algorithms for high dimensional optimization which are tolerant to a small fraction of adversarial corruptions have been developed [1, 7, 11, 16, 24], building on new algorithms for high dimensional robust statistics [5, 7, 9, 18]. This setting is different from ours. For instance, in their setting, there are statistical barriers so that no algorithm can achieve an optimization error below some ﬁxed threshold, no matter how many samples are taken. In contrast, in the current Byzantine setting, the adversarial corruptions can only occur in a fraction of the machines (as opposed to each machine having some adversarial corruptions). For this reason, our results do not extend to their scenario.

2 Preliminaries

Throughout this paper, we denote by k k the Euclidean norm and [n] := {1, 2, . . . , n}. We reiterate some deﬁnitions regarding strong convexity, smoothness, and Lipschitz continuity (for other equivalent deﬁnitions, see Nesterov [21]).

Deﬁnition 2.1. For a differentiable function f : Rd ! R,

f is σ-strongly convex if 8x, y 2 Rd, it satisﬁes f(y) f(x)+hrf(x), y xi+ σ

f is L-Lipschitz smooth (or L-smooth for short) if 8x, y 2 Rd, krf(x) rf(y)k

f is G-Lipschitz continuous if 8x 2 Rd, krf(x)k G.

Byzantine Convex Stochastic Optimization. We let m be number of worker machines and assume at most an fraction of them are Byzantine for 2

. We denote by good [m] the set of good (i.e. non-Byzantine) machines. Obviously, the algorithm does not know good. We let D be a distribution over (not necessarily convex) functions fs : Rd ! R. Our goal is to approximately minimize the following objective:

f(x) := Es D[fs(x)]

where we assume f is convex. In each iteration k = 1, 2, . . . , T, the algorithm is allowed to specify a point xk and query m machines. Each machine i 2 [m] gives back a vector rk,i 2 Rd satisfying

Assumption 2.2. For each iteration k 2 [T] and for every i 2 good, we have rk,i = rfs(xk) for a random sample s D, and krk,i rf(xk)k V. Remark 2.3. For each k 2 [T] and i 62 good, the vector rk,i can be adversarially chosen and may depend on {rk0,i}k0 k,i2[m]. In particular, the Byzantine machines can even collude in an iteration.

The next fact is completely classical (for projected mirror descent).

Fact 2.4. If xk+1 = arg miny : ky x1k D{ 1

2ky xkk2 + h , y xki}, then 8u: ku x1k D:

h , xk ui h , xk xk+1i kxk xk+1k2

2 + kxk uk2

2 kxk+1 uk2

3 Description and Analysis of Byzantine SGD

Without loss of generality, in this section we assume that we are given a starting point x1 2 Rd and want to solve the following more general problem:2

min kx x1k D

f(x) := Es D[fs(x)]

We denote by x an arbitrary minimizer to Problem (3.1). Our algorithm Byzantine SGD is formally stated in Algorithm 1. In each iteration at point xk, Byzantine SGD tries to identify a set goodk of candidate good machines, and then perform stochastic gradient update only with respect to goodk [m], by using direction k :=

i2goodk rk,i.

The way goodk is maintained is by constructing two estimation sequences . Namely, for each machine i 2 [m], we maintain a real value Ai = Pk

t=1hrt,i, xt x1i and a vector Bi = Pk

t=1 rt,i. Then, we denote by Amed the median of {A1, . . . , Am} and Bmed some vector median of {B1, . . . , Bm}. We also deﬁne rmed to be some vector median of {rk,1, . . . , rk,m}. For instance for {rk,1, . . . , rk,m}, our vector median is deﬁned as follows. We select rmed to be any rk,i as long as

(({j 2 [m]: krk,j rk,ik 2V}

(( > m/2. Such an index i 2 [m] can be efﬁciently computed because our later lemmas shall ensure that at least (1 )m indices in [m] are valid choices for i. Therefore, one can for instance guess a random index i and verify whether it is valid. In expectation at most 2 guesses are needed, so ﬁnding these quantities can be done in linear time. Starting from good0 = [m], we deﬁne goodk to be all the machines i from goodk 1 whose Ai is TA-close to Amed, Bi is TB-close to Bmed, and rk,i is 4V-close to rmed. We will prove that if the thresholds TA and TB are chosen appropriately, then goodk always contains all machines in good.

Bounding the Error. As we shall see, the error incurred by Byzantine SGD contains two parts:

hrk,i rf(xk), xk x i

Error2 := 1

rk,i rf(xk)

Error1 is due to the bias created by the stochastic gradient (of good machines) and the adversarial noise (of Byzantine machines); while Error2 is the variance of using k to approximate rf(xk). As we shall see, Error2 is almost always well bounded. However, the adversarial noise incurred in Error1 can sometimes destroy the convergence of SGD. We therefore use {Ai}i and {Bi}i to perform a reasonable estimation of Error1, and remove the bad machines if they misbehave. Note that even at the end of the algorithm, good T may still contain some Byzantine machines; however, their adversarial noise must be negligible and shall not impact the performance of the algorithm. We have the following argument to establish bounds on the two error terms:

Lemma 3.1. With probability 1 δ, we simultaneously have

Tm C + 16 m DV

TC and Error2 32 2V2 + 4V2C

2This is so because even in unconstrained setting, classical SGD requires knowing an upper bound D to kx1 x k in order to choose the learning rate. We can thus add the constraint to the objective.

Algorithm 1 Byzantine SGD( , x1, D, T, TA, TB)

Input: learning rate > 0, starting point x1 2 Rd, diameter D > 0, number of iterations T,

thresholds TA, TB > 0;

theory suggests TA = 4DV

T log(16m T/δ) and TB = 4V

T log(16m T/δ) where δ is conﬁdence parameter 1: good1 [m]; 2: for k 1 to T do 3: for i 1 to m do 4: receive rk,i 2 Rd from machine i 2 [m]; we have E[rk,i] = rf(xk) if i 2 good

t=1hrt,i, xt x1i and Bi Pk

t=1 rt,i; 6: end for 7: Amed := median{A1, . . . , Am} 8: Bmed Bi where i 2 [m] is any machine s.t.

(({j 2 [m]: k Bj Bik TB}

(( > m/2. all machines i 2 good will be valid choice, see Claim A.3b 9: rmed rk,i where i 2 [m] is any machine s.t.

(({j 2 [m]: krk,j rk,ik 2V}

(( > m/2 all machines i 2 good will be valid choice due to Assumption 2.2 10: goodk

i 2 goodk 1 : |Ai Amed| TA k Bi Bmedk TB krk,i rmedk 4V

; with high probability goodk good

11: xk+1 = arg miny : ky x1k D

1 2ky xkk2 +

i2goodk rk,i, y xk

12: end for

The proof of this lemma will be in two parts: ﬁrst, we deﬁne a set of determinstic conditions, and show that these conditions hold with high probability. Then, we will demonstrate that assuming these concentration results hold, the error will be bounded. The details of the proof are deferred to the full version of this paper. With this crucial lemma, we can now prove some rates for our algorithm.

Smooth functions. We ﬁrst consider the setting where our objective is smooth, and prove:

Theorem 3.2. Suppose in Problem (3.1) our f(x) is L-smooth and Assumption 2.2 holds. Suppose 1 2L and TA = 4DV

TC and TB = 4V

TC. Then, with probability at least 1 δ, letting C := log(16m T/δ) and x := x2+ +x T +1

T , we have

f(x) f(x ) D2

Tm C + 32 m DV

If is chosen optimally, then

f(x) f(x ) O

We remark that

The ﬁrst term O

is the classical error rate for gradient descent on smooth objectives [21] and should exist even if V = 0 (so every rk,i exactly equals rf(xk)) and = 0. The ﬁrst two terms e O

together match the classical mini-batch error rate for SGD on smooth objectives, and should exist even if = 0 (so we have no Byzantine machines).

The third term e O

is optimal in our Byzantine setting due to Theorem 4.3.

Proof of Theorem 3.2. Applying Fact 2.4 for k = 1, 2, . . . , T with u = x , we have

h k, xk x i D2

h k, xk xk+1i 1

2 kxk xk+1k2

rk,i, xk xk+1

2 kxk xk+1k2

We notice that the left hand side of (3.2) X

h k, xk x i (3.3)

hrk, xk x i + 1

hrk,i rk, xk x i

f(xk) f(x )

f(xk+1) f(x ) hrk, xk+1 xki L

2 kxk xk+1k2#

Above, inequality uses the convexity of f( ) and the deﬁnition of Error1, and inequality uses the smoothness of f( ) which implies f(xk+1) f(xk)+hrf(xk), xk+1 xki+ L

2 kxk xk+1k2. Putting (3.4) back to (3.2), we have

f(xk+1) f(x )

Tm + Error2 . (3.5)

Above, inequality uses the fact that 1 2 L

2 1 4 , and Young s inequality which says ha, bi 1 2kbk2 1

2kak2. Finally, we conclude the proof by plugging Lemma 3.1 and the following convexity inequality into (3.5):

f(xk+1) f(x )

f(xk) f(x )

f(xk) f(x )

Nonsmooth Functions. We also derive a similarly tight result when the objective is not assumed to be smooth. The proof is similar to the previous one and we defer it to the supplementary material.

Theorem 3.3. Suppose in Problem (3.1) our f(x) is differentiable, G-Lipschitz continuous and Assumption 2.2 holds. Suppose > 0 and TA = 4DV

TC and TB = 4V

TC. Then, with probability at least 1 δ, letting C := log(16m T/δ) and x := x1+ +x T

T , we have

f(x) f(x ) D2

Tm C + 32 m DV

If is chosen optimally, then

f(x) f(x ) O

We remark that, as for Theorem 3.2, the ﬁrst two terms are asymptotically tight for SGD in this setting, and the last term is necessary in our Byzantine setting, as we show in Theorem 4.3.

Strongly convex functions. We now consider the problem3

f(x) := Es D[fs(x)]

where f(x) is σ-strongly convex. (3.6)

3To present the simplest result, we have assumed that Problem (3.6) is unconstrained. One can also impose an addition constraint kx x0k D but we refrain from doing so.

In this setting, we can obtain similarly optimal rates to those we obtained before, by reducing the problem to repeatedly solving non-strongly convex ones, as in Hazan and Kale [14]. When the function is additionally smooth, we obtain:

Theorem 3.4. Suppose in Problem (3.6) our f(x) is L-smooth and Assumption 2.2 holds. Given x0 2 Rd with guarantee kx0 x k D, one can repeatedly apply Byzantine SGD to ﬁnd a point x satisfying with probability at least 1 δ0, f(x) f(x ) " and kx x k2 2"/σ in

iterations, where the e O notation hides logarithmic factors in D, m, L, V, σ 1, " 1, δ 1.

When the function is non-smooth, we instead obtain:

Theorem 3.5. Suppose in Problem (3.6) our f(x) is differentiable, G-Lipschitz continuous and Assumption 2.2 holds. Given x0 2 Rd with guarantee kx0 x k D, one can repeatedly apply Byzantine SGD to ﬁnd a point x satisfying with probability at least 1 δ0, f(x) f(x ) " and kx x k2 2"/σ in

iterations, where the e O notation hides logarithmic factors in D, m, L, V, σ 1, " 1, δ 1.

We defer the proofs to the supplementary material, but we remark that again in all of these equations, our rates have three terms. Just as in the rates for non-strongly convex functions, the ﬁrst two terms are necessary even when there are no Byzantine workers, and the last term matches the lower bound we give in Theorem 4.4 for Byzantine optimization.

4 Lower Bounds for Byzantine Stochastic Optimization

In this section, we prove that the convergence rates we obtain in Section 3 are optimal up to log factors, even in d = 1 dimension. Recall a random vector X 2 Rd is subgaussian with variance proxy V2 if u T X is a univariate subgaussian random variable with variance proxy V2 for all unit vectors u 2 Rd. We require the following deﬁnition:

Deﬁnition 4.1 (Stochastic estimator). Given X Rd and f : X ! R, we say a random function fs (with s drawn from some distribution D) is a stochastic estimator for f if E[fs(x)] = f(x) for all x 2 X. Furthermore, we say fs is subgaussian with variance proxy V2 if rfs(x) rf(x) is a subgaussian random variable with variance proxy V2/d for all x 2 X.

Note that the normalization factor of 1/d in this deﬁnition ensures that E

krfs(x) rf(x)k2

O(V2), which matches the normalization used in this paper and throughout the literature. However, in our lower bound constructions it turns out that it sufﬁces to take d = 1. We prove our lower bounds only against subgaussian stochastic estimators. This is different from our Assumption 2.2 used in the upper-bound theorems, where we assumed krfs(x) rf(x)k V is uniformly bounded for all x in the domain. Remark 4.2. Such difference is negligible, because by concentration, if fs is a sample from a subgaussian stochastic estimator with variance proxy V2, then krfs(x) rf(x)k O

with overwhelming probability. As a result, this impacts our lower bounds only by a log(m T) factor. For simplicity of exposition, we only state our theorems in subgaussian stochastic estimators.

Our result for non-strongly convex stochastic optimization is the following:

Theorem 4.3. For any D, V, " > 0 and 2 (0, 0.1), there exists a linear function f : [ D, D] ! R (of Lipscthiz continuity G = "/D) with a subgaussian stochastic estimator with variance proxy V2 so that, given m machines, of which m are Byzantine, and T samples from the stochastic estimator per machine, no algorithm can output x so that f(x) f(x ) < " with probability 2/3

"2m + 2V2D2

, where x = arg minx2[ D,D] f(x).

Observe that up to log factors, this matches the upper bound in Theorem 3.3 exactly, demonstrating that both are exactly tight. We get a similarly tight result for the strongly convex case:

Theorem 4.4. For any V, σ > 0 and 2 (0, 0.1), there exists a σ-strongly convex quadratic function f : R ! R with a subgaussian stochastic estimator of variance proxy V2 so that, given m machines, of which m are Byzantine, and T samples from the stochastic estimator per machine, no algorithm can output x so that |x x | < b" with probability 2/3 unless T =

V2 mσ2b"2 + 2V2

where x = arg minx2R f(x).

Since f(x) f(x ) " = σb"2

2 implies kx x k b" by the strong convexity of f, Theorem 4.4 also implies the following corollary for function value approximation:

Corollary 4.5. In the same setting as Theorem 4.4, no algorithm can output x so that f(x)

f(x ) " with probability 2/3 unless T =

V2 mσ" + 2V2

Remark 4.6. The lower bound of Yin et al. [31] uses essentially the same construction as we do in the proof of Theorem 4.4. However, in d dimensions, they use a subgaussian estimator for f with variance proxy d V2 (so E

krfs(x) rf(x)k2

O(d V2)). As a result, their lower bound appears to have an additional d factor in it. Once re-normalized to have variance proxy V2, the hard instance in [31] yields exactly the same lower bound as our Theorem 4.4.

5 Conclusion

We have presented the ﬁrst tight (up to logarithmic factors) sample and time complexity bounds for distributed SGD in the Byzantine setting, by leveraging concentration bounds to obtain a new set of detection criteria for malevolent machines. While this setting is arguably the most fundamental setting for Byzantine SGD, there remain a number of open questions to explore. For instance, our methods require strong concentration of the gradients, strong enough to invoke Pinelis 1994 inequality. Is it possible to achieve similar results while assuming weaker assumptions on the gradients? Alternatively, is it possible that the problem provably becomes more difﬁcult? There are two additional interesting questions for future work. The ﬁrst is to study Byzantineresilient variants of our protocol in a decentralized model, where there is no correct central coordinator, which can safely aggregate gradients. A second important question is exploring practical implementations of our algorithm. Our algorithm only requires adding simple, efﬁciently implementable checks to traditional mini-batch SGD. As a result, we believe that in practice, our algorithm should add minimal overhead, while providing strong robustness guarantees against machine failure, essentially for free . We leave such a real-world evaluation of our method to future work. Finally, we believe that the general algorithmic framework developed in this paper may ﬁnd further applications to robust distributed estimation problems. Philosophically, our algorithm enforces conditions on the malicious machines in an online fashion, as the data arrives in every iteration. This is in contrast to previous approaches to Byzantine optimization such as [31] which instead enforce similar conditions using ofﬂine techniques, i.e. by looking at the entire dataset. The main advantage of our technique is that the per iteration time complexity is substantially faster, since we do not need to inspect the entire dataset every time. It is an interesting question whether similar techniques can yield fast distributed algorithms for other estimation problems.

Acknowledgement

We would like to thank Yuval Peres for suggesting reference [23]. Jerry Li is supported by NSF CAREER Award CCF-1453261, CCF-1565235, a Google Faculty Research Award, and an NSF Graduate Research Fellowship.

[1] Sivaraman Balakrishnan, Simon S Du, Jerry Li, and Aarti Singh. Computationally efﬁcient

robust sparse estimation in high dimensions. In Conference on Learning Theory, pages 169 212, 2017. [2] Ron Bekkerman, Mikhail Bilenko, and John Langford. Scaling up machine learning: Parallel

and distributed approaches. Cambridge University Press, 2011. [3] Aharon Ben-Tal and Arkadi Nemirovski. Lectures on Modern Convex Optimization. Society

for Industrial and Applied Mathematics, January 2013. ISBN 978-0-89871-491-3. doi: 10. 1137/1.9780898718829. [4] K. Bhatia, P. Jain, P. Kamalaruban, and P. Kar. Consistent robust regression. In Advances in

Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 2107 2116, 2017. [5] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In

Advances in Neural Information Processing Systems, pages 721 729, 2015. [6] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning

with adversaries: Byzantine tolerant gradient descent. In NIPS, pages 118 128, 2017. [7] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In

STOC, pages 47 60. ACM, 2017. [8] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial

settings: Byzantine gradient descent. ar Xiv preprint ar Xiv:1705.05491, 2017. [9] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair

Stewart. Robust estimators in high dimensions without the computational intractability. In FOCS, pages 655 664. IEEE, 2016. [10] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair

Stewart. Robustly learning a gaussian: Getting optimal error, efﬁciently. In SODA, pages 2683 2702. SIAM, 2018. [11] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alis-

tair Stewart. Sever: A robust meta-algorithm for stochastic optimization. ar Xiv preprint ar Xiv:1803.02815, 2018. [12] Paul Feldman and Silvio Micali. Optimal algorithms for byzantine agreement. In Proceedings

of the twentieth annual ACM symposium on Theory of computing, pages 148 161. ACM, 1988. [13] Jiashi Feng, Huan Xu, and Shie Mannor. Distributed robust learning. ar Xiv preprint ar Xiv:1409.5937, 2014. [14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: Optimal algorithms for

stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1): 2489 2512, 2014. [15] P. J. Huber and E. M. Ronchetti. Robust statistics. Wiley New York, 2009. [16] Adam Klivans, Pravesh K. Kothari, and Raghu Meka. Efﬁcient algorithms for outlier-robust

regression. ar Xiv preprint ar Xiv:1803.03241, 2018. [17] Jakub Koneˇcn y, H Brendan Mc Mahan, Felix X Yu, Peter Richt arik, Ananda Theertha Suresh,

and Dave Bacon. Federated learning: Strategies for improving communication efﬁciency. ar Xiv preprint ar Xiv:1610.05492, 2016. [18] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance.

In FOCS, pages 665 674. IEEE, 2016. [19] Leslie Lamport, Robert Shostak, and Marshall Pease. The byzantine generals problem. ACM

Transactions on Programming Languages and Systems (TOPLAS), 4(3):382 401, 1982. [20] N. M. Nasrabadi, T. D. Tran, and N. Nguyen. Robust lasso with missing and grossly corrupted

observations. In Advances in Neural Information Processing Systems (NIPS), 2011. [21] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, vol-

ume I. Kluwer Academic Publishers, 2004. ISBN 1402075537. [22] N. H. Nguyen and T. D. Tran. Exact recoverability from dense corrupted observations via

1-minimization. IEEE Transactions on Information Theory, 59(4):2017 2035, 2013.

[23] Iosif Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The

Annals of Probability, pages 1679 1706, 1994. [24] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust

estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485, 2018. [25] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for

strongly convex stochastic optimization. In ICML, 2012. [26] Lili Su and Nitin H Vaidya. Fault-tolerant multi-agent optimization: optimal iterative dis-

tributed algorithms. In PODC, pages 425 434. ACM, 2016. [27] Lili Su and Nitin H Vaidya. Defending non-bayesian learning against adversarial attacks. ISDC, 2016. [28] J.W. Tukey. Mathematics and picturing of data. In Proceedings of ICM, volume 6, pages

523 531, 1975. [29] Blake Woodworth and Nati Srebro. Tight Complexity Bounds for Optimizing Composite Ob-

jectives. In NIPS, 2016. [30] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Generalized Byzantine-tolerant SGD.

ar Xiv preprint ar Xiv:1802.10116, 2018. [31] Dong Yin, Yudong Chen, Kanna Ramchandran, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. ar Xiv preprint ar Xiv:1803.01498, 2018.