# distributed_zeroorder_optimization_under_adversarial_noise__65177ee7.pdf

Distributed Zero-Order Optimization under Adversarial Noise

Arya Akhavan CSML, Istituto Italiano di Tecnologia and CREST, ENSAE, IP Paris aria.akhavanfoomani@iit.it

Massimiliano Pontil CSML, Istituto Italiano di Tecnologia and University College London massimiliano.pontil@iit.it

Alexandre B. Tsybakov CREST, ENSAE, IP Paris alexandre.tsybakov@ensae.fr

We study the problem of distributed zero-order optimization for a class of strongly convex functions. They are formed by the average of local objectives, associated to different nodes in a prescribed network. We propose a distributed zero-order projected gradient descent algorithm to solve the problem. Exchange of information within the network is permitted only between neighbouring nodes. An important feature of our procedure is that it can query only function values, subject to a general noise model, that does not require zero mean or independent errors. We derive upper bounds for the average cumulative regret and optimization error of the algorithm which highlight the role played by a network connectivity parameter, the number of variables, the noise level, the strong convexity parameter, and smoothness properties of the local objectives. The bounds indicate some key improvements of our method over the state-of-the-art, both in the distributed and standard zero-order optimization settings. We also comment on lower bounds and observe that the dependency over certain function parameters in the bound is nearly optimal.

1 Introduction

We study the problem of distributed optimization where each node (or agent) has an objective function fi : Rd R and exchange of information is limited between neighbouring agents within a prescribed network of connections. The goal is to minimize the average of these objectives on a closed bounded convex set Θ Rd,

min x Θ f(x) where f(x) = 1

i=1 fi(x). (1)

Distributed optimization has been widely studied in the literature, we refer to Tsitsiklis et al. [1986], Nedic and Ozdaglar [2009], Nedic et al. [2010], Boyd et al. [2011], Duchi et al. [2012], Jakoveti c et al. [2014], Lobel et al. [2011], Kia et al. [2015], Shi et al. [2014], Jakoveti c [2019], Scaman et al. [2019], Pu et al. [2021] and references therein. This problem has broad applications such as multi-agent target seeking Liu et al. [2017], distributed learning Kraska et al. [2013], and wireless networks Park et al. [2020], among others.

We address problem (1) from the perspective of zero-order distributed optimization. That is we assume that only function values can be queried by the algorithm, subject to measurement noise.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

During the optimization procedure, each agent maintains a local copy of the variables which are sequentially updated based on local and neighboring functions queries. We wish to devise such optimization procedures which are efﬁcient in bounding the average optimization error and cumulative regret in terms of the functions properties and network topology.

Contributions Our principal contribution is a distributed zero-order optimization algorithm, introduced in Section 2, which we show to achieve tight rates of convergence under certain assumptions on the objective functions, outlined in Section 3. Speciﬁcally, we consider that the local objectives fi are β-Hölder and the average objective f is α-strongly convex. The algorithm relies on a novel zero-order gradient estimator, presented in Section 4. Although conceptually very simple, this estimator, when employed within our algorithm, allows us to obtain an O(d2) computational gain as well as improved error rates than previous state-of-the-art zero-order optimization procedures Akhavan et al. [2020], Bach and Perchet [2016], in the special case of standard (undistributed) setting. Another key advantage of our approach is due to the general noise model presented in Section 5, under which function values are queried. The noise variables do not need to be zero mean or independently sampled, and thus they include adversarial noise. In Section 6, we derive the rates of convergence for the cumulative regret and the optimization error of the proposed algorithm, and in Section 7 we consider the special case of 2-smooth functions. The rates highlight the dependency with respect to the number of variables d, the number of function queries T, the spectral gap of the network matrix 1 ρ, and the parameters n, α and β. The bounds enjoy a better dependency on 1 ρ than previous bounds on zero-order distributed optimization Qu and Li [2018], Yu et al. [2019], Tang et al. [2019]. We also compare our bounds to related lower bounds in Akhavan et al. [2020] for undistributed setting, observing that our rates are optimal either with respect to T and α, or with respect to T and d.

Previous Work We brieﬂy comment on previous related work and defer to Section 8 for a more in depth discussion and comparison. For both deterministic and stochastic scenarios of problem (1), a large body of literature is devoted to ﬁrst-order gradient based methods with a consensus scheme (see the papers cited above and references therein). On the other hand, the study of zero-order methods was started only recently Qu and Li [2018], Sahu et al. [2018b,a], Hajinezhad et al. [2019], Yu et al. [2019], Tang et al. [2019]. The works Qu and Li [2018], Yu et al. [2019], Tang et al. [2019] are dealing with zero-order distributed methods in noise-free settings while the noisy setting is developed in Hajinezhad et al. [2019], Sahu et al. [2018b,a]. Namely, Hajinezhad et al. [2019] considers 2-point zero-order methods with stochastic queries for non-convex optimization but assume that the noise is the same for both queries, which makes the problem analogous to noise-free scenario in terms of optimization rates. Papers Sahu et al. [2018b,a] study zero-order distributed optimization for strongly convex and β-smooth functions fi with β {2, 3}. They derive bounds on the optimization error, though without providing closed form expressions.

Notation Throughout we denote by , and be the standard inner product and Euclidean norm on Rd, respectively, and by the spectral norm of a matrix. The notation I is used for the n-dimensional identity matrix and 1 for the vector in Rn with all entries equal to 1. We denote by ej the j-th canonical basis vector in Rd. For any set A, the number of elements in A is denoted by |A|. For x R, the value x is the maximal integer less than x. For every closed convex set Θ Rd

and x Rd we denote by ProjΘ(x) = argmin{ z x : z Θ} the Euclidean projection of x onto Θ. We denote by diam(Θ) the Euclidean diameter of Θ. Finally we let U[ 1, 1] be the uniform distribution on [ 1, 1].

2 The Problem

Let n be the number of agents and let G = (V, E) be an undirected graph, where V = {1, . . . , n} is the set of nodes and E V V is the set of edges. The adjacency matrix of G is the symmetric matrix (Aij)n i,j=1 deﬁned as Aij = 1, if (i, j) E and zero otherwise. We consider the following sequential learning framework, where each agent i gets values of function fi corrupted by noise and shares information with other agents. At step t, agent i acts as follows:

makes queries and gets noisy values of fi, provides a local output ui(t) based on these queries and on the past information, broadcasts ui(t) to neighboring agents,

Algorithm 1 Distributed Zero-Order Gradient

Input Communication matrix (Wij)n i,j=1, step sizes (ηt > 0)T0 1 t=1 Initialization Choose initial vectors x1(1) = = xn(1) Rd For t = 1, . . . , T0 1

For i = 1, . . . , n

1. Build an estimate gi(t) of the gradient fi(xi(t)) using noisy evaluations of fi 2. Update xi(t+1)= Pn k=1 Wik ProjΘ(xk(t) ηtgk(t)) End End Output Approximate minimizer x(T0) = 1

n Pn i=1 xi(T0) of the average objective f = 1

n Pn i=1 fi

updates its local variable using information from other agents as follows:

j=1 Wijuj(t),

where W = (Wij)n i,j=1 is a given matrix called the consensus matrix.

Below we use the following condition on the consensus matrix. Assumption A. Matrix W is symmetric, doubly stochastic, and ρ := W n 111 < 1.

Matrix W accounts for the connectivity properties of the network. If Wij = 0 the agents i and j are not connected (do not exchange information). Often W is deﬁned as a doubly stochastic matrix function of the adjacency matrix A of the graph. One popular example is as follows:

Aij γ max{d(i),d(j)} if i = j,

Aki γ max{d(i),d(k)} if i = j,

where d(i) = Pn j=1 Aij is the degree of node i and γ > 0 is a constant. Then, clearly, W = (Wij) is a symmetric and doubly stochastic matrix, and Wij = 0 if agents i and j are not connected. Moreover, we have ρ < 1 c/n2 for a constant c > 0 (see Qu and Li [2018], Olshevsky [2014]). Values of spectral gaps ρ for some other W reﬂecting different network topologies can be found in Duchi et al. [2012]. Typically, ρ < 1 an, where an = Ω(n 1) or an = Ω(n 2). Parameter ρ can be viewed as a measure of difference between the distributed problem and a standard optimization problem. If the graph of communication is a complete graph a natural choice is W = n 11n1 n and then ρ = 0. For more examples of consensus matrices W, see Olshevsky and Tsitsiklis [2009], Duchi et al. [2012] and references therein.

The local outputs ui can be deﬁned in different ways. Our approach is outlined in Algorithm 1. At Step 1, an estimate of the gradient of the local objective fi at xi(t) is constructed. This involves a randomized procedure that we describe and justify in Section 4. The local output ui is deﬁned as an update of the projected gradient algorithm with such an estimated gradient. At Step 2 of the algorithm, each agent computes the next point by a local consensus gradient descent step, which uses local and neighbor information. Step 2 of the algorithm is known as gossip method, see e.g., Boyd et al. [2006]), which was initially introduced as an approach for the networks with the imposed connection between the nodes changing by time. We also refer to Sayin et al. [2017] for similar algorithms in the context of distributed stochastic ﬁrst-order gradient methods.

3 Assumptions on Local Objectives

In this section, we give some deﬁnitions and introduce our assumptions on the local objective functions f1, . . . , fn. Deﬁnition 1. Denote by Fβ(L) the set of all functions f : Rd R that are ℓ= β times differentiable and satisfy, for all x, z Rd the Hölder-type condition f(z) X

1 m!Dmf(x)(z x)m L z x β, (2)

Algorithm 2 Gradient Estimator with 2d Queries

Input Function F : Rd R and point x Rd Requires Kernel K : [ 1, 1] R, parameter h > 0 Initialization Generate random r from uniform distribution on [ 1, 1] For j = 1, . . . , d

1. Obtain noisy values yj = F(x + hrej) + ξj and y j = F(x hrej) + ξ j 2. Compute gj = 1 2h(yj y j)K(r) End Output g = (gj)d j=1 Rd estimator of F(x)

where L > 0, the sum is over the multi-index m = (m1, ..., md) Nd, we used the notation m! = m1! md!, |m| = m1 + + md, and we deﬁned, for every ν = (ν1, . . . , νd) Rd,

Dmf(x)νm = |m|f(x) m1x1 mdxd νm1 1 νmd d .

Elements of the class Fβ(L) are referred to as β-Hölder functions.

Deﬁnition 2. Function f : Rd R is called 2-smooth if it is differentiable on Rd and there exists L > 0 such that, for every (x, x ) Rd Rd, it holds that

f(x) f(x ) L x x .

Deﬁnition 3. Let α > 0. Function f : Rd R is called α-strongly convex if f is differentiable on Rd and f(x) f(x ) f(x ), x x + α

2 x x 2 , x, x Rd.

Assumption B. Functions f1, . . . , fn: (i) belong to the class Fβ(L), for some β 2, and (ii) are 2-smooth.

In Section 6 we will analyse the convergence properties of Algorithm 1 when the objective function f in 1 is α-strongly convex. We stress that we do not need the functions f1, . . . , fn, to be as well α-strongly convex. It is enough to make such an assumption on the compound function f, while the local functions fi only need to satisfy the smoothness conditions stated in Assumption B above.

4 Gradient Estimator

In this section, we detail our choice of gradient estimators gi(t) used at Step 1 of Algorithm 1. We consider Algorithm 2. For any function F : Rd R and any point x, the vector g returned by Algorithm 2 is an estimate of F(x) based on noisy observations of F at randomized points. The estimator is computed for every node i at each step t, thus giving the vectors g = gi(t) in Algorithm 1. The gradient estimator crucially requires a kernel function K : [ 1, 1] R that allows us to take advantage of possible higher order smoothness properties of f. Speciﬁcally, in what follows we assume that Z u K(u)du = 1, Z uj K(u)du = 0, j = 0, 2, 3, . . . , ℓ, and κβ Z |u|β|K(u)|du < , (3)

for given β 2 and ℓ= β . In Polyak and Tsybakov [1990] such kernels can be constructed as weighted sums of Legendre polynomials, in which case κβ 2

2β with β 1; see also Appendix A.3 in Bach and Perchet [2016] for a derivation.

The gradient estimator in Algorithm 2 differs from the standard 2d-point Kiefer-Wolfowitz type estimator in that it uses multiplication by a random variable K(r) with a well-chosen kernel K. On the other hand, it is also different from the previous kernel-based estimators in zero-order optimization literature Polyak and Tsybakov [1990], Bach and Perchet [2016], Akhavan et al. [2020] in that it needs 2d function queries per step, whereas those estimators require only one or two queries; see, in particular, Algorithm 1 in Akhavan et al. [2020] for a comparison. At ﬁrst sight, this seems a big drawback of the estimator proposed here, however we will show below that thanks to this estimator

we achieve both a more efﬁcient optimization procedure and better rate of convergences for the optimization error.

When the estimator in Algorithm 2 is used at the t-th outer step of Algorithm 1, it should be intended as a random variable that depends on the randomization used during the current estimation at the given node, as well as on the randomness of the past iterations, inducing the σ-algebra Ft (see Section 5 for the deﬁnition). Bounds for the bias of this estimator conditional on the past and for its second moment play an important role below, in our analysis of the convergence rates. These bounds are presented in the next two lemmas, whose proofs are presented in Appendix B. We state them in the simpler setting of Algorithm 2, with no reference to the ﬁltration (Ft)t N.

Lemma 1. Let f : Rd R be a function in Fβ(L), β 2, and let the random variables ξ1, . . ., ξd and ξ 1, . . ., ξ d be independent of r and satisfy E[|ξj|] < , E[|ξ j|] < , for j = 1, . . ., d. Let the kernel satisfy conditions (3). If the gradient estimator g of f given by Algorithm 2 then, for all x Rd,

E[g] f(x) Lκβ

It is straightforward to see that the bound of Lemma 1 holds when the estimators are build recursively during the execution of Algorithm 1 and the expectation is taken conditionally on Ft. This will be used in the proofs.

Lemma 2. Let f : Rd R be 2-smooth and let maxx Θ f(x) G, κ R K2(u)du < . Let the random variables ξ1, . . . , ξd and ξ 1, . . . , ξ d be independent of r and E[ξ2 j ] σ2, E[(ξ j)2] σ2 for j = 1, . . . , d. If g is deﬁned by Algorithm 2, where x is a random variable with values in Θ independent of r and depending on ξ1, . . . , ξd and ξ 1, . . . , ξ d in an arbitrary way, then

4 h2 + 9G2κ.

5 Noise Model

Algorithm 2 is called to compute estimators of gradients of the local functions fi, i = 1, . . . n, at each iteration t of Algorithm 1. Thus, we assume that agent i at iteration t generates a uniform random variable ri(t) U[ 1, 1] and gets 2d noisy observations, deﬁned, for j = 1, . . . , d

yi,j(t) = f(xi(t) + htri(t)ej) + ξi,j(t)

y i,j(t) = f(xi(t) + htri(t)ej) + ξ i,j(t)

where the parameters ht > 0 will be speciﬁed later.

In what follows, we denote by Ft the σ-algebra generated by the random variables xi(t), for i = 1, . . . , n. In order to meet the conditions of Lemmas 1 and 2 for each (i, t), we impose the following assumption on the collection of random variables (ri(t), ξi,j(t), ξ i,j(t)).

Assumption C. For all integers t and i {1, . . . , n} the following properties hold.

(i) The random variables ri(t) U[ 1, 1] are independent of ξi,1(t), . . . ξi,d(t), ξ i,1(t), . . . , ξ i,d(t) and from the σ-algebra Ft,

(ii) E[(ξi,j(t))2] σ2, E[(ξ i,j(t))2] σ2 for j = 1, . . . , d, and some σ 0.

Assumption C is very mild. Indeed, its part (i) occurs as a matter of course since it is unnatural to assume dependence between the random environment noise and artiﬁcial random variables ri(t) generated by the agents. We state (i) only for the purpose of formal rigor. Remarkably, we do not assume the noises ξi,j(t) and ξ i,j(t) to have zero mean. What is more, these variables can be deterministic and no independence between them for different i, j, t is required, so we consider an adversarial environment. Having such a relaxed assumption on the noise is possible because of the multiplication by the zero-mean variable K(r) in Algorithm 2. This and the fact that all components of the vectors are treated separately allows the proofs go through without the zero-mean assumption and under arbitrary dependence between the noises.

6 Main Results

In this section, we provide upper bounds on the performance of the proposed algorithms. Recall that T0 is the number of outer iterations in Algorithm 1. Let T be the total number of times that we observed noisy values of each fi. At each iteration of Algorithm 2 we make 2d queries. Thus, to keep the total budget equal to T we need to make T0 = T/(2d) steps of Algorithm 1 (assuming that T/(2d) is an integer). We compare our results to lower bounds for any algorithm with the total budget of T queries.

For given β 2, we choose the tuning parameters ηt and h = ht in Algorithms 1 and 2 as

αt, and ht = t 1

Inspection of the proofs in Appendix C shows that these values of ηt and ht lead to the best rates minimizing the bounds. As one can expect, there are two contributions to the bounds, one representing the usual stochastic optimization error, while the second one accounts for the distributed character of the problem. This second contribution to the bounds is driven by the following quantity that we call the mean discrepancy: (t) n 1 Pn i=1 E[ xi(t) x(t) 2]. It plays an important role in our argument and may be of interest by itself, cf. Tang et al. [2019]. The next lemma gives a control of the mean discrepancy. Lemma 3. Let Assumptions A, B, and C hold. Let Θ be a convex compact subset of Rd. Assume that diam(Θ) K and maxx Θ f(x) G. If the updates xi(t), x(t) are deﬁned by Algorithm 1, in which the gradient estimators for i-th agent are deﬁned by Algorithm 2 with F = fi, i = 1, . . . , n, and parameters (4) then

(t) A ρ 1 ρ

where A is a constant independent of t, d, α, n, ρ. The explicit value of A can be found in the proof.

Proof Sketch. Let V (t) = Pn i=1 xi(t) x(t) 2, and zi(t) = ProjΘ xi(t) ηtgi(t) (xi(t) ηtgi(t)). The ﬁrst step is to show that, due to the deﬁnition of the algorithm and Assumptions A on matrix W, we have

V (t + 1) ρ2 n X

xi(t) x(t) ηt(gi(t) g(t)) + zi(t) z(t) 2 , (6)

where g(t) and z(t) denote the averages of gi(t) s and zi(t) s over the agents i. From (6), by using the fact that zi(t) ηt gi(t) , applying Lemma 1 conditionally on Ft, taking expectations and then applying Lemma 2 we deduce the recursion

(t + 1) ρ (t) + A1 ρ2

where A1 > 0 is a constant. The initialization of Algorithm 1 is chosen so that (1) = 0. It follows that (t) is bounded by a discrete convolution that can be carefully evaluated leading to (5).

Using Lemma 3 we obtain the following theorem. Theorem 4. Let f be an α-strongly convex function and let the assumptions of Lemma 3 be satisﬁed. Then for any x Θ the cumulative regret satisﬁes

t=1 E f( x(t)) f(x) d α(1 ρ)T

1 β 0 B1 + B2ρ2 + B3 α(1 ρ)(log(T0) + 1),

where the positive constants Bi are independent of T0, d, α, n, ρ. The explicit values of these constants can be found in the proof. Furthermore, if x is the minimizer of f over Θ the optimization error of the averaged estimator ˆx(T0) = 1 T0 PT0 t=1 x(t) satisﬁes

E[f(ˆx(T0)) f(x )] d α(1 ρ)T β 1

β 0 B1 + B2ρ2 + B3 α(1 ρ)

log(T0) + 1

Proof sketch. Note ﬁrst that, due to the deﬁnition of Algorithm 1 and to the properties of matrix W we have x(t+1) = x(t) ηt g(t)+ z(t). This resembles the usual recursion of the gradient algorithm with an additional term z(t) = n 1 Pn i=1 zi(t), where zi(t) ηt gi(t) . Using this bound and α-strong convexity of f, analyzing the recursion in the standard way and taking conditional expectations we obtain that, for any x Θ,

f( x(t)) f(x) 1 2ηt E at at+1|Ft αat

i=1 E gi(t) 2 |Ft

+ E g(t)|Ft f( x(t)) x(t) x | {z } Bias1

ηt E z(t), x(t) x |Ft

| {z } Bias2

where at = x(t) x 2. Here, the term Bias2 is entirely induced by the distributed nature of the problem. Using the properties of Euclidean projection and some algebra, it can be bounded as

Bias2 3ηt 2(1 ρ)n

i=1 E gi(t) 2 |Ft + 1 ρ

xi(t) x(t) 2 . (9)

On the other hand, Bias1 accumulates two contributions, the ﬁrst due to the gradient approximation (cf. Lemma 1) and the second due to the distributed nature of the problem:

dhβ 1 t x(t) x + L n

xi(t) x(t) x(t) x

α dh2(β 1) t + αat

xi(t) x 2 + LK2

Next, we combine inequalities (8) (10), take expectations of both sides of the resulting inequality, and use Lemmas 2 and 3 to bound the second moments E gi(t) 2 and the mean discrepancy. The ﬁnal result is obtained by summing up from t = 1 to t = T0 and recalling that ηt = 2 αt, ht = t 1

Due to α-strong convexity of f, Theorem 4 immediately implies a bound on the estimation error E[ ˆx(T0) x 2]. The bound is of the order of the right-hand side of (7) divided by α. Furthermore, we get the following result about local estimators, which follows from a slight modiﬁcation of Lemma 3 and Theorem 4. Corollary 5. Let Assumptions A, B, and C hold. Let Θ be a convex compact subset of Rd. Assume that diam(Θ) K and maxx Θ f(x) G. If the updates xi(t) are deﬁned by Algorithm 1, in which the gradient estimators for i-th agent are deﬁned by Algorithm 2 with F = fi, i = 1, . . . , n, and parameters ηt = 4 α(t+1), ht = t 1

2β then the local average estimator ˆxi(T0) =

2 T0(T0+1) PT0 t=1 txi(t) satisﬁes

E[ ˆxi(T0) x 2] C min 1, d α2(1 ρ)T β 1

, i = 1, . . . , n,

where C > 0 is a positive constant independent of T0, d, α, n, ρ.

We now state a corollary of Theorem 4 for an algorithm with total budget of T queries. Assume that T0 = T/(2d) is an integer. As our algorithm makes 2d queries per step the estimator ˆx(T/(2d)) uses the total budget of T queries. Combining Theorem 4 with the trivial bound E[f(ˆx(T/(2d)) f(x )] GK we get the following result. Corollary 6. Let T 2d and let the assumptions of Theorem 4 be satisﬁed. Then we have

E[f(ˆx(T/(2d)) f(x )] C min 1, d2 1/β

α(1 ρ)T β 1

where C > 0 is a positive constant independent of T, d, α, n, ρ.

We now state several important implications of our results.

Remark 1. Previous bounds on zero-order distributed optimization Qu and Li [2018], Yu et al. [2019], Tang et al. [2019] contain a dependency of (1 ρ) 2 in the "connectivity" parameter ρ. While Theorem 4 covers a more difﬁcult noisy setting, our bound displays a better dependency of (1 ρ) 1. Since most common values of 1 ρ are of the order n 2 (or n 1), this represents a substantial gain. Remark 2. The case n = 1, ρ = 0 corresponds to usual (undistributed) zero-order stochastic optimization. Then Corollary 6 gives a bound of order min 1, d2 1/β

β . This improves upon

the bound1 min 1, d2

β obtained under the same assumptions in Akhavan et al. [2020]. Still our bound does not match the minimax lower bound established in Akhavan et al. [2020] and equal to min max(α, T 1/2+1/β), d

For α 1 the lower bound (11) scales as min 1, d

β . It has the same behavior in the interesting regime of α not too small (α T 1/2+1/β) and T d. Note, however, that the lower bound (11) is obtained for the setting with i.i.d. noise, while our upper bound is valid under adversarial noise. Therefore, it may seem rather surprising that the ratio is only d1 1/β. Remark 3. With the same budget of queries T, the 2d-point method in Algorithm 2 is computationally simpler than the methods with one or two queries per step Polyak and Tsybakov [1990], Bach and Perchet [2016], Akhavan et al. [2020] previously suggested for the same setting. For example, the method in Bach and Perchet [2016], Akhavan et al. [2020] prescribes, at each step t = 1, . . . , T, to generate a random variable uniformly distributed on the unit sphere in Rd. This requires of order d calls of one-dimensional random variable generator. Overall, in T steps, the number of calls is of order d T. For our method with the same budget T, we make of order T0 = T/(2d) steps and at each step we need to call the generator only once in order to get r U[ 1, 1]. Thus, with the same budget of queries, Algorithm 2 needs 1/d2 less calls of random variable generator than the gradient estimator in Bach and Perchet [2016], Akhavan et al. [2020].

In Appendix E, we present a numerical comparison between the algorithm proposed in this paper and that in Akhavan et al. [2020]. The results conﬁrm our theoretical ﬁndings. The algorithm of this paper converges faster and the advantage is more pronounced as d increases.

7 Improved Bounds for β = 2

In this section we provide improved upper bounds for the case β = 2 in Corollary 6, where we relax the dependency over d, from d3/2 to d.

Following the literature on undistributed zero-order optimization, we use a standard 2-point method with elements of the analysis developed in Flaxman et al. [2005], Agarwal et al. [2010], Duchi et al. [2015], Shamir [2013, 2017], Akhavan et al. [2020] among others. Speciﬁcally, we deﬁne

gi(t) = d 2ht (yi(t) y i(t))ζi(t) (12)

where yi(t) = fi(xi(t)+htζi(t))+ξi(t), y i(t) = fi(xi(t) htζi(t))+ξ i(t),

with the random variables ζi(t), 1 i n, 1 t T, that are i.i.d. uniformly distributed on the unit Euclidean sphere in Rd. We make the following assumption on the noise analogous to Assumption C. Assumption D. For all integers t and all i {1, . . . , n} the following properties hold.

(i) The random variables ζi(t) are independent of ξi(t), ξ i(t) and from the σ-algebra Ft,

(ii) E[(ξi(t))2] σ2, E[(ξ i(t))2] σ2 for some σ 0.

Theorem 7. Let f be an α-strongly convex function. Let Assumptions A, B, and D hold with β = 2. Let Θ be a convex compact subset of Rd, and assume that diam(Θ) K. Assume that

1The recent work Novitskii and Gasnikov [2021] obtains the same improvement using the gradient estimator of Akhavan et al. [2020]. However, as we explain in Remark 3 that estimator is less appealing from the computational point of view.

maxx Θ fi(x) G, for 1 i n. Let the updates xi(t), x(t) be deﬁned by Algorithm 1, in

which the gradient estimator for i-th agent is deﬁned by (12), and ηt = 1 αt, ht = 3d2σ2 2Lαt+9L2d2 1/4 .

Then for the estimator x(T) = 1 T T/2 PT t= T/2 +1 x(t) we have

E[f( x(T)) f(x )] B 1 ρ

where B > 0 is a constant independent of T, d, α, n, ρ.

The main idea of the proof is to use surrogate functions ˆf i t(x), for 1 i n, deﬁned, for every x Rd, as ˆf i t(x) = Efi(x + ht ζ), where the expectation with respect to the random vector ζ uniformly distributed on the unit ball Bd = {u Rd : u 1}. A result, which can be traced back to Nemirovsky and Yudin [1983] implies the fact that gi(t) is an unbiased estimator of the gradient of the surrogate function ˆf i t at xi(t). Thus, we can consider Algorithm 1 as a gradient descent for the surrogate function. Then replacing fi and f by the surrogate functions with the cost of the order h2 t, we can recover the initial problem. This method does not work for β > 2 since the error of approximation by surrogate function becomes of bigger order than the optimal rate T β 1

β . The results that we implement as tools for this section are given in Appendix D.

Combining Theorem 7 with the obvious bound E[f( x(T)) f(x )] GK we obtain

E[f( x(T)) f(x )] B

1 ρ min 1, d

where B > 0 is a constant independent of T, d, α, n, ρ. By comparing this upper bound with the minimax lower bound (11) for β = 2, one can note that (13) is optimal with respect to the parameters T and d when α 1.

8 Discussion

We expand our discussion on previous related work, comparing our results to the state-of-the-art distributed and undistributed zero-order optimization settings, and highlight few key open problems.

Comparison to Zero-Order Distributed Settings Distributed opimization with noisy functions queries was considered in detail in Sahu et al. [2018b,a], where the setting differs from ours in some key aspects: the updates are obtained not as in Step 2 of Algorithm 1 but rather via decentralized techniques, matrix W is random, the noise is zero-mean random rather than adversarial, and 2-point gradient estimator is used. Papers Sahu et al. [2018b,a] provide, for β = 2 and β = 3, bounds on E[ xi(T) x 2] of the order at least n3/2 (1 ρ)2 T 1/2 and n3/2 (1 ρ)2 T 2/3, respectively, as functions of n, ρ and T. Their bounds contain uncontrolled terms of the form E[ xi(k0) x 2] for some large enough k0 = k0(n, α, d) leaving unclear the resulting rate. Paper Hajinezhad et al. [2019] considers 2-point methods with stochastic queries but assume that the noise is the same for both queries and deal with non-convex optimization. Noisy-free zero-order distributed optimization is studied by Qu and Li [2018], Yu et al. [2019], Tang et al. [2019]. From these, Tang et al. [2019] is the closest to our work as it builds on the updates as at Step 2 of Algorithm 1 (though without projections). The bounds obtained therein are of the order (1 ρ) 2 considered as functions of ρ, although they hold for the larger class of gradient dominant functions. As noted in Remark 1 the bound of Theorem 4 scales only as (1 ρ) 1 and this bound holds true, in particular, for noisy-free setting, which is its special case corresponding to σ = 0. Since most common values of 1 ρ are of the order n 2 (or n 1), this represents a substantial gain. Moreover, Theorem 4 covers a difﬁcult noise setting as we deal with adversarial noise. It is also worthwhile to note that the ﬁrst-order distributed optimization exhibits much better dependency on ρ since bounds that scale as (1 ρ) 1/2 can be achieved, see [Duchi et al., 2015, Scaman et al., 2019]. Some of the references mentioned above considered unconstrained optimization while our results deal with constrained optimization. Note that the only difference in the proofs of the upper bounds for constrained and unconstrained cases is in the presence of an additional term proportional to the second moment of the gradient at the update (see Lemma 2.4 in Akhavan et al. [2020] for a similar argument). Since this additional term is bounded independently of ρ the overall dependency on ρ remains the same.

Computational and Statistical Advantage of the Proposed Gradient Estimator As we highlighted in Section 6 the gradient estimator in Algorithm 2 requires 2d function queries. At ﬁrst sight this seems problematic when the dimension d is high, as they need at least T = 2d queries. However, the lower bounds in Shamir [2013], Akhavan et al. [2020] reported in (11) above indicate that no estimator can achieve nontrivial convergence rate for zero-order optimization when T d

β β 1 . Thus, having the total budget of T d queries is a necessary condition for success of any zero-order stochastic optimization method. Algorithms with one or two queries per step can, of course, be realized for T d but in this case they do not enjoy any nontrivial error behavior. Moreover, by Remark 3, with the same total budget of queries T, the gradient estimator from Algorithm 2 is computationally more efﬁcient2 than the estimators in Polyak and Tsybakov [1990], Bach and Perchet [2016], Akhavan et al. [2020]. Indeed, with the same budget of queries, it needs 1/d2 less calls of random variable generator than it would be for the gradient estimator in Bach and Perchet [2016], Akhavan et al. [2020]. At the same time, as detailed in Remark 2 the proposed gradient estimator yields better rates on the optimization error. We conclude that the proposed zero-order optimization procedure provides both a computational and statistical improvement over the state-of-the-art methods in Akhavan et al. [2020].

Limitations and Future Work A main problem, which remains open, is to study whether the dependency of (1 ρ) 1 in the upper bounds in Corollary 6 and Theorem 7 is minimax optimal. Moreover, in the standard (undistributed) setting it remains an open problem to design a zeroorder optimization procedure that meets the minimax lower bound (11) with respect to all problem parameters (T, d, β and α). Further directions of research include the analysis of disturbed zero-order algorithms for larger classes of functions, such as α-gradient dominant ones, as well as extension of our results to stochastic updates or asynchronous activation schemes.

Funding Disclosure

A. Akhavan and M. Pontil were partially supported by SAP SE. The research of A.B. Tsybakov is supported by a grant of the French National Research Agency (ANR), Investissements d Avenir (Lab Ex Ecodec/ANR-11-LABX-0047).

A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multipoint bandit feedback. In Proc. 23rd International Conference on Learning Theory, pages 28 40, 2010.

A. Akhavan, M. Pontil, and A.B. Tsybakov. Exploiting higher order smoothness in derivative-free optimization and continuous bandits. In Advances in Neural Information Processing Systems 33, 2020.

F. Bach and V. Perchet. Highly-smooth zero-th order online optimization. In Proc. 29th Annual Conference on Learning Theory, pages 1 27, 2016.

A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin. Escaping the local minima via simulated annealing: Optimization of approximately convex functions. In Proc. 28th Annual Conference on Learning Theory, pages 240 265, 2015.

S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE Transactions on Information Theory, 52(6):2508 2530, 2006.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3:1 122, 2011.

2One may object that the computation bottleneck in zero-order optimization is in function evaluation; however such costs are external to the optimization procedure, for example they may be performed by black-box software running on external machines or devices. Thus such costs should not be taken into account in evaluating the procedure itself. In this sense our computational speedup is important for high dimensional settings.

L. Devroye, L. Györﬁ, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, New York, 1996.

J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3): 592 606, 2012.

J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788 2806, 2015.

A. D. Flaxman, A. T. Kalai, and H. B. Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proc. 16th Annual ACM-SIAM Symposium on Discrete algorithms (SODA), pages 385 -394, 2005.

D. Hajinezhad, M. Hong, and A. Garcia. Zeroth order nonconvex multi-agent optimization over networks. IEEE Transactions on Automatic Control, 64(10):3995 4010, 2019.

D. Jakoveti c. A uniﬁcation and generalization of exact distributed ﬁrst-order methods. IEEE Transactions on Signal and Information Processing over Networks, 5(1):31 46, 2019.

D. Jakoveti c, J. Xavier, and J. M. F. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59(5):1131 1146, 2014.

S. Kia, J. Cortés, and S. Martínez. Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication. Autom., 55:254 264, 2015.

T. Kraska, A. Talwalkar, J. C. Duchi, R. Grifﬁth, M. Franklin, and M. I. Jordan. MLbase: A distributed machine-learning system. In CIDR, 2013.

L. Liu, C. Luo, and F. Shen. Multi-agent formation control with target tracking and navigation. In 2017 IEEE International Conference on Information and Automation (ICIA), pages 98 103, 2017. doi: 10.1109/ICInf A.2017.8078889.

I. Lobel, A. Ozdaglar, and D. Feijer. Distributed multi-agent optimization with state-dependent communication. Mathematical Programming, 129:255 284, 2011.

A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009.

A. Nedic, A. Ozdaglar, and P. Parrilo. Constrained consensus and optimization in multi-agent networks. Automatic Control, IEEE Transactions on, 55:922 938, 2010.

A. S. Nemirovsky and D. B Yudin. Problem Complexity and Method Efﬁciency in Optimization. Wiley & Sons, 1983.

V. Novitskii and A. Gasnikov. Improved exploiting higher order smoothness in derivative-free optimization and continuous bandit. ar Xiv preprint ar Xiv:2101.03821, 2021.

A. Olshevsky. Linear time average consensus on ﬁxed graphs and implications for decentralized optimization and multi-agent control. ar Xiv: Optimization and Control, 2014.

A. Olshevsky and J. Tsitsiklis. Convergence speed in distributed consensus and control. SIAM Journal on Control and Optimization, 48(1):33 55, 2009.

J. Park, S. Samarakoon, A. Elgabli, J. Kim, M. Bennis, S.-L. Kim, and M. Debbah. Communicationefﬁcient and distributed learning over wireless networks: Principles and applications, 08 2020.

T. B. Polyak and A. B. Tsybakov. Optimal order of accuracy of search algorithms in stochastic optimization. Problems of Information Transmission, 26(2):45 53, 1990.

S. Pu, W. Shi, J. Xu, and A. Nedi c. Push-pull gradient methods for distributed optimization in networks. IEEE Transactions on Automatic Control, 66(1):1 16, 2021.

G. Qu and N. Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network System, 5(5):1245 1260, 2018.

A. Sahu, D. Jakovetic, D. Bajovic, and S. Kar. Communication-efﬁcient distributed strongly convex stochastic optimization: Non-asymptotic rates. ar Xiv:1809.02920, 2018a.

A. Sahu, D. Jakovetic, D. Bajovic, and S. Kar. Distributed zeroth order optimization over random networks: A kiefer-wolfowitz stochastic approximation approach. pages 4951 4958, 12 2018b.

M. O. Sayin, N. D. Vanli, S. S. Kozat, and T. Ba sar. Stochastic subgradient algorithms for strongly convex optimization over distributed networks. IEEE Transactions on Network Science and Engineering, 4(4):248 260, 2017.

K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié. Optimal convergence rates for convex distributed optimization in networks. Journal of Machine Learning Research, 20:1 31, 2019.

O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Proc. 30th Annual Conference on Learning Theory, pages 1 22, 2013.

O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. Journal of Machine Learning Research, 18(1):1703 1713, 2017.

W. Shi, G. Wu, and W. Yin. Extra: An exact ﬁrst-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944 966, 2014.

Y. Tang, J. Zhang, and N. Li. Distributed zero-order algorithms for nonconvex multi-agent optimization. ar Xiv preprint ar Xiv:1908.11444v3, 2019.

J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803 812, 1986.

Z. Yu, D. W. C. Ho, and D. Yuan. Distributed randomized gradient-free mirror descent algorithm for constrained optimization. ar Xiv preprint ar Xiv:1903.04157, 2019.