# distributed_saddlepoint_problems_under_data_similarity__dcb9c421.pdf

Distributed Saddle-Point Problems Under Similarity

Aleksandr Beznosikov MIPT , HSE University and Yandex, Russia

anbeznosikov@gmail.com

Gesualdo Scutari Purdue University, USA gscutari@purdue.edu

Alexander Rogozin MIPT and HSE University, Russia aleksandr.rogozin@phystech.edu

Alexander Gasnikov MIPT, HSE University and ISP RAS , Russia

gasnikov@yandex.ru

We study solution methods for (strongly-)convex-(strongly)-concave Saddle-Point Problems (SPPs) over networks of two type master/workers (thus centralized) architectures and mesh (thus decentralized) networks. The local functions at each node are assumed to be similar, due to statistical data similarity or otherwise. We establish lower complexity bounds for a fairly general class of algorithms solving the SPP. We show that a given suboptimality > 0 is achieved over master/workers networks in

δ/µ log(1/")

rounds of communications, where δ > 0 measures the degree of similarity of the local functions, µ is their strong convexity constant, and is the diameter of the network. The lower communication complexity bound over mesh networks reads

1/p δ/µ log(1/")

, where is the (normalized) eigengap of the gossip matrix used for the communication between neighbouring nodes. We then propose algorithms matching the lower bounds over either types of networks (up to log-factors). We assess the effectiveness of the proposed algorithms on a robust regression problem.

1 Introduction

We study smooth (strongly-)convex-(strongly-)concave SPPs over a network of M agents:

min x2X max

y2Y f(x, y) := 1

fm(x, y), (P)

where X, Y Rd are convex and compact sets common to all the agents; and fm(x, y) is the loss function of agent m, known only to the agent. Problem (P) has found a wide range of applications, including, game theory [42, 10], image deconvolution problems [7], adversarial training [3, 12], and statistical learning [1] see Sec. 2 for some motivating examples in the distributed setting. We are particularly interested in learning problems, where each fm is the empirical risk that measures the mismatch between the model to be learned and the local dataset owned by agent m.

Since the functions fm can be accessed only locally and routing local data to other agents is infeasible or highly inefﬁcient, solving (P) calls for the design of distributed algorithms that alternate between a local computation procedure at each agent s side, and a round of communication among (suitably chosen) neighboring nodes. We address such a design considering explicitly two type of computational architectures, namely: (i) master/workers networks these are centralized systems suitable for parallel computing; for instance, they are the typical computational architecture arising

Moscow Institute of Physics and Technology ISP RAS Research Center for Trusted Artiﬁcial Intelligence

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

from federated learning applications (e.g., [17]), where data are split across multiple workers and computations are performed in parallel, coordinated by the master node(s); and (ii) mesh networks these are distributed systems with no special topology (modeled just as undirected graphs), which capture scenarios wherein there is no hierarchical structure (e.g., master nodes) and each node can communicate only with its intermediate neighbors.

Function similarity: Motivated in particular by machine learning applications, our design and analysis pertain to distributed algorithms for SPPs (P) where the local functions fm s are related quantities such as gradients and the second derivatives matrices of fm s differ only by a ﬁnite quantity δ > 0; we will term such SPPs as δ-related SPPs. For instance, this is the typical situation in the aforementioned distributed empirical risk minimization setting [2, 14, 47]: when data are i.i.d. among machines, the fm s reﬂect statistical similarities in the data residing at different nodes, resulting in a δ = O(1/pn), where n is the local sample size ( O hides log-factors and dependence on d).

While SPPs have been extensively studied in the centralized setting (e.g., [10, 29, 18, 30, 5]) and more recently over mesh networks [23, 27, 22, 26, 36, 4, 6], we are not aware of any analysis or (distributed) algorithm that explicitly exploit function similarity to boost communication efﬁciency either lower complexity bounds or upper bounds. On the other hand, recent works for sum-utility minimization problems over networks (e.g., [2, 38, 35, 45, 43, 11, 47, 14, 39, 20]) show that employing some form of statistical preconditioning in the algorithm design provably reduces communication complexity. Whether these improvements are possible/achievable for δ-related SSPs in the form (P) remains unclear. This paper provides a positive answer to the above open problem.

Major contributions: Our major results are summarized next. (a) Lower complexity bounds: Under mild structural assumptions on the algorithmic oracle (satisﬁed by a variety of methods), we establish lower complexity bounds for the δ-related SPP (P) with µ-strongly-convex-strongly -concave, L-smooth (twice-differentiable) local functions: an " precision on the optimality gap over

master/workers system is achieved in

δ/µ log(1/")

communication steps, where is the diameter of the network. The lower complexity bound over mesh networks reads

1/p δ/µ log(1/")

rounds of communications, where is the (normalized) eigengap of the gossip matrix used for the communication between neighbouring nodes. These new lower bounds show a more favorable dependence on the optimization parameters (via δ/µ) than that of distributed oracles for SPPs ignoring function similarity [5, 36], whose communication complexity, e.g., over mesh networks reads

1/p L/µ log(1/")

. The latter provides a pessimistic prediction when δ/µ L/µ. This is the typical situation of ill-conditioned problems, such as many learning problems where the regularization parameter that is optimal for test predictive performance is so small that a scaling with L/µ is no longer practical while δ/µ is (see, e.g., [25, 14]). (b) Near optimal algorithms: We proposed algorithms for such SPPs over master/workers and mesh networks that match the lower bounds up to logarithmic factors. They are provably faster than existing solution methods for µ-strongly-convex-strongly-concave, L-smooth SPPs, which do not exploit function similarity. Preliminary numerical results on distributed robust logistic regression support our theoretical ﬁndings.

1.1 Related works Methods for SPPs ignoring function similarity: (Strongly)-convex-(strongly)-concave SPPs have been extensively studied in the optimization literature and as special instances of (strongly) monotone Variational Inequalities (VI) [10, 16]. Several algorithms are available in the centralized setting, some directly imported from the VI literature; representative examples include: the mirror-proximal algorithm [29], Extragradient method [18] and the scheme in [30] they are readily implementable on master/workers architectures as well. For SPPs with µ-strongly-convex-strongly-concave, L-smooth loss, all these schemes achieve iteration complexity of O

L/µ log(1/")

, which has been shown to be optimal for ﬁrst-order methods solving such a class of SPPs [46, 34]. Lower bounds and optimal algorithms in the distributed setting for SPPs without similarity have been studied in [5].

Note that none of the above lower (and upper) complexity bounds or (centralized or distributed) algorithmic designs capture function similarity. As a consequence, convergence rates certiﬁed in the aforementioned works, when applicable to δ-related SPPs in the form (P), provide quite pessimistic predictions, in the setting 1 + δ/µ L/µ.

Methods for sum-utility minimization exploiting function similarity: Several works exploited the idea of statistical preconditioning to provably improve communication complexity of solution methods for the minimization of the sum of δ-related, µ-strongly convex and L-smooth functions over mas-

ter/workers networks. Lower complexity bounds are established in [2], and read

δ/µ log(1/")

, which contrasts with O

L/µ log(1/")

achievable by ﬁrst-order (Nesterov) accelerated methods [31], certifying thus faster rates whenever δ/µ < L/µ. Solutions methods exploiting function similarity are mirror proximal-like schemes, and include [38, 35, 45] (for quadratic losses), [47] (for self-concordant losses), [43], and [11] (for composite optimization), with [14] employing acceleration. None of these methods are implementable over mesh networks, because they rely on a centralized (master) node. To our knowledge, Network-DANE [20] and SONATA [39] are the only two methods that leverage statistical similarity to enhance convergence of distributed methods over mesh networks; [20] studies strongly convex quadratic losses while [39] considers general objectives, achieving a communication complexity of e O((1/p ) δ/µ log(1/")), where e O hides logarithmic factors. None of the methods above however are applicable to the δ-related SPP (P).

1.2 Notation

Given a positive integer M, we deﬁne [M] = {1, . . . , M}. We use hx, yi := Pd

i=1 xiyi to denote standard inner product of x, y 2 Rd. It induces 2-norm in Rd in the following way kxk :=

hx, xi. We also introduce proj Z(z) = minu2Z ku zk the Euclidean projection onto Z. We order the eigenvalues of any symmetrix matrix A 2 Rm m in nonincreasing fashion, i.e., λmax(A) = λ1(A) . . . λm(A) = λmin(A), with λmax( ) [resp. λmin( )] denoting the largest (resp. smallest) eigenvalue.

2 Setup and Background

Problem setting: We begin introducing the main assumptions underlying Problem (P) and some useful notation.

Let us stack the xand y-variables in the tuple z = (x, y); accordingly, deﬁne Z = X Y and the vector-functions Fm, F : Z ! R2d:

rxfm(x, y) ryfm(x, y)

, and F(z) := 1

The following conditions are standard for strongly convex-strongly concave SPPs.

Assumption 1 Given (P), the following hold:

(i) ; 6= Z is a convex set;

(ii) Each fm : R2d ! R is twice differentiable on (an open set containing) Z, with L-Lipschitz

gradient: k Fm(z1) Fm(z2)k Lkz1 z2k, for all z1, z2 2 Z;

(iii) f(z) is µ-strongly convex-strongly concave on Z, i.e., h F(z1) F(z2), z1 z2i µkz1

z2k2, for all z1, z2 2 Z;

(iv) Each fm(z) is convex-concave on Z, i.e. 0-strongly convex-strongly concave.

We are interested in ﬁnding the solution z = (x , y ) of Problem (P) under function similarity.

Assumption 2 (δ-related fm s) The local functions are δ-related: for all (x, y) 2 Z,

xxfm(x, y) r2

xxf(x, y)k δ,

xyfm(x, y) r2

xyf(x, y)k δ,

yyfm(x, y) r2

yyf(x, y)k δ.

The interesting case is when 1 + δ/µ L/µ. When the fm s are empirical loss functions over local data sets of size n, under standard assumptions on data distributions and learning model (e.g., [47, 14]), δ = O(1/pn) with high probability ( O hides log-factors and dependence on d) some motivating examples falling in this category are discussed in Sec. 2.1 below. While such examples

represent important applications, we point out that our (lower and upper) complexity bounds are valid in all scenarios wherein Assumption 2 holds, not necessarily due to statistical arguments.

Network setting: The communication network is modeled as a ﬁxed, connected, undirected graph, G , (V, E), where V , {1, . . . , M} denotes the vertex set the set of agents while E , {(i, j) | i, j 2 V} represents the set of edges the communication links; (i, j) 2 E iff there exists a communication link between agent i and j. We denote by the diameter of the graph. When it comes to distributed algorithms over mesh networks, we leverage neighbouring communications among adjoining nodes. Communications of d-dimensional vectors will be modeled as a matrix multiplication by a matrix W (a.k.a. gossip matrix). The following assumptions on W are standard to establish convergence of distributed algorithms over mesh networks.

Assumption 3 The matrix W 2 RM M satisﬁes the following: (a) It is compliant with G, that is, (i) wii > 0, 8i 2 [M]; (ii) wij > 0, if {j, i} 2 E; and (iii) wij = 0 otherwise; (b) It is symmetric and

stochastic, that is, W1 = 1 (and thus also 1>W = 1>).

Notice that a direct consequence of Assumption 3 (along with the fact that G is connected) is that

, 1 max{λ2(W), |λmin(W)|} < 1, (2)

where is the eigengap between the ﬁrst and second largest (magnitude) eigenvalue of W. Roughly speaking, measures how fast the network mixes information (the larger, the faster).

2.1 Motivating examples

Several problems of interest can be cast in the SPP (P), for which function similarity arises naturally, some are brieﬂy discussed next.

Robust Regression: Consider the robust instance of the linear regression problem in its Lagrangian form:

(w T (xi + r) yi)2 + λ

2 krk2, (3)

where w are the weights of the model, {(xi, yi)}N

i=1 are pairs of the training data, and r models the noise, and λ and β are the regularization parameters. Let n be the local sample size (thus N = nm). The typical regularization parameter that is optimal for test predictive performance is λ = O(1/

N). Assuming β of the same order of λ and invoking function similarity δ = O(1/pn) [25, 14] yield a condition number of the problem = O(pm n) while δ/µ = O(pm). This implies that ﬁrst order methods applied to (3) will slowdown as the local sample size n grows. Rate scaling with δ/µ would be instead independent on the local sample size.

Adversarial robustness of neural networks: Recent works have demonstrated that deep neural networks are vulnerable to adversarial examples inputs that are almost indistinguishable from natural data and yet classiﬁed incorrectly by the network [40, 13]. To improve resistance to a variety of adversarial inputs, a widely studied approach is leveraging robust optimization and formulate the training as saddle-point problem [24, 32]:

l(f(w, xi + r, yi)2 + λ

where w are the weights of the model, {(xi, yi)}N

i=1 are pairs of the training data, r is the so-called adversarial noise, which models a perturbation in the data, and λ and β are the regularizers.

Other optimization problems: Other instances of the SPP are the (online) transport or Wasserstein Barycenter (WB) problems, see [15, 9]. This representation comes from the dual view of transportation polytope. b) Another example is Lagrangian based optimization problems. For instance, consider the minimization of the sum of loss functions, each one associated to one agent, subject to some (common) constraints. The problem can be equivalently rewritten as a saddle-point problem using Lagrangian multipliers. It is easy to check that if the agents functions are δ-related, then the resulting saddle-point problem is also so.

3 Lower Complexity Bounds

In this section we establish lower complexity bounds for centralized (i.e., master/workers-based) and distributed (gossip-based) algorithms. We begin introducing the back-box procedure describing the class of algorithms these lower bounds pertain to.

3.1 Optimization/communication oracle

Our procedure models a fairly general class of (centralized and distributed) algorithms over graphs, whereby nodes perform local computation and communication tasks. Computations at each node are based on linear operations involving current or past iterates, gradients, and vector products with local Hessians and their inverses, as well as solving local optimization problems involving such quantities. During communications, the nodes can share (compatibly with the graph topology) any of the vectors they have computed up until that time. The black-box procedure can be formally describe as follows.

Deﬁnition 1 (Oracle) Each agent m has its own local memories Mx

m for the xand y-variables, respectively with initialization Mx

m = {0}. Mx

m are updated as follows.

Local computation: Between communication rounds, each agent m computes and adds to its Mx

m a ﬁnite number of points x, y, each satisfying

x + βrxfm(x, y) 2 span

x0 , rxfm(x0, y0),

xxfm(x00, y00) + D)x0 , (r2

xxfm(x00, y00) + D)rxfm(x0, y0)

xxfm(x00, y00) + D) 1x0 , (r2

xxfm(x00, y00) + D) 1rxfm(x0, y0),

xyfm(x00, y00))y0 , (r2

xyfm(x00, y00))ryfm(x0, y0)

y 'ryfm(x, y) 2 span

y0 , ryfm(x0, y0),

yyfm(x00, y00) + D)y0 , (r2

yyfm(x00, y00) + D)ryfm(x0, y0)

yyfm(x00, y00) + D) 1y0 , (r2

yyfm(x00, y00) + D) 1ryfm(x0, y0),

xyfm(x00, y00))T x0 , (r2

xyfm(x00, y00))T rxfm(x0, y0)

(4) for given x0, x00 2 Mx

m and y0, y00 2 My

m; some , β, , ' 0 such that + β > 0 and + ' > 0; and D is some diagonal matrix (such that all the inverse matrices exist).

Communication: Based upon communication rounds among neighbouring nodes, Mx

m are updated according to

Output: The ﬁnal global output is calculated as:

, y K 2 span

The above oracle captures a gamut of existing centralized and distributed algorithms. For instance, local computations model either inexact local solutions e.g., based on single/multiple steps of gradient or Newton-like updates, which corresponds to setting = = 1 and β = ' = 0 or exact solutions of agents subproblems (via some subroutine algorithm), corresponding to = = 0 and β = ' = 1. Multiple rounds of computations (resp. communications) can be performed between communication rounds (resp. computation tasks). Notice that the proposed oracle builds on [37, 2] for minimization problems over networks the former modeling only gradient updates and the latter considering only centralized optimization (master/workers systems).

3.2 Lower complexity bounds

We are in the position to state our main results on lower communication complexity Theorem 1 pertains to algorithms over master/workers systems while Theorem 2 deals with mesh networks.

Theorem 1 For any L, µ, δ > 0 and connected graph G with diameter > 0, there exist a SPP in the form (P) (satisfying Assumption 1) with Z = R2d (where d is sufﬁciently large), x 6= 0, y 6= 0, and local functions fm being L-smooth, µ-strongly-convex-strongly-concave, δ-related (Assumption 2) such that any centralized algorithm satisfying Deﬁnition 1 produces the following estimate on the global output z K = (x K, y K) after K communication rounds:

kz K z k2 =

C C A ky k2

Corollary 1 In the setting of Theorem 1, the number of communication rounds required to obtain a "-solution is lower bounded by

Theorem 2 For any L, µ, δ > 0 and 2 (0; 1], there exist a SPP in the form (P) (satisfying Assumption 1) with Z = R2d(where d is sufﬁciently large), x 6= 0, y 6= 0, and local functions fm being L-smooth, µ-strongly-convex-strongly-concave, δ-related (Assumption 2), and a gossip matrix W over the connected graph G, satisfying Assumption 3 and with eigengap , such that any decentralized algorithm satisfying Deﬁnition 1 and using the gossip matrix W in the communication steps (5) produces the following estimate on the global output z K = (x K, y K) after K communication rounds:

kz K z k2 =

C C A ky k2

Corollary 2 In the setting of Theorem 2, the number of communication rounds required to obtain a "-solution is lower bounded by

These lower complexity bounds show an expected dependence on the optimization parameters and network quantities. Speciﬁcally, the number of communications scale proportionally to δ/µ this generalizes existing lower bounds [5] that do not account for such similarity, resulting instead in the more pessimistic dependence on L/µ typically δ L. The network impact is captured by the diameter of the network for master/workers architectures communications steps are required in the worst case to transmit a message between two nodes and the eigengap of the matrix W, when arbitrary graph typologies are consider; 1/p can be bounded as O(T), where T is the largest hitting time of the Markov chain with probability transition matrix W [33]. For instance, for fully connected networks = 1/p = 1 while for star networks = 1 and 1/p =

M. For general graphs, 1/p can be larger than , see [28] for more details. To certify the tightness of the derived lower bounds, the next section designs algorithms that reach such bounds.

4 Optimal algorithms

4.1 Centralized case (master/workers systems)

Our ﬁrst optimal algorithm is for SPPs over master/workers architectures or more generally networked systems where a spanning tree (with the root as master node) is preliminary set; it is formally described in Algorithm 1. We assumed w.l.o.g. that the master node owns function f1.

Some insights on the genesis of this method are discussed next.

Consider for a moment the minimization problem minx2X f(x) := 1 M

m=1 fm(x), under Assumption 2. Following [38] we can solve it invoking the mirror descent algorithm, which reads

xk+1 = arg min

h rf(xk), xi + Dφ(x, xk)

where Dφ(x, y) = φ(x) φ(y) hrφ(y), x yi is the Bregman divergence, with function φ(x) = f1(x)+ δ

2kxk2. It is shown that we can take stepsize = 1 ([48, 14]). Therefore, (8) can be rewritten as

xk+1 = arg min

1 δ f1(x) + 1

@@@@x xk + 1

δ (rf(xk) rf1(xk))

Noting that in Algorithm 1 γ 1

δ (see Appendix B.1), one infers the connection between (9) and the updates in lines 3 (i) and 3 (ii). The extra step as in line 3 (iii) is due to the fact that Algorithm 1 solves a SPP (and not a classical minimization as postulated above): gradient descent-like methods as (8) are not optimal for SPPs; in fact, they might diverge when applied to general convex-concave SPPs. Out approach is then to employ Forward-Backward-Forward algorithms [41] or the Extragradient [18] method, which leads to the step in line 3 (iii).

Another interpretation of the proposed algorithm comes from looking at Problem (P) as a composite minimization problem, with objective function h1(x, y) + h2(x, y), with h1(x, y) = f1(x, y) and h2(x, y) = 1 M

m=1(fm(x, y) f1(x, y)). The ﬁrst function h1 is L-smooth and convex-concave while h2 is δ-smooth and, in general, non-convex-non-concave. Such type of problems can be solved invoking sliding techniques [19, 36].

Algorithm 1 (Star Min-Max Data Similarity Algorithm)

Parameters: stepsize γ, accuracy e; Initialization: Choose z0 = (x0, y0) 2 Z, z0

m = z0, for all m 2 [M]; 1: for k = 0, 1, 2, . . . do 2: Each worker m computes Fm(zk) and sends it to the master; 3: The master node: (i) computes vk = zk γ

F(zk) F1(zk)

; (ii) ﬁnds uk, s.t. kuk ˆukk2 e, where ˆuk is the solution of:

min ux2X max

γf1(ux, uy) + 1

(iii) updates zk+1 = proj Z

uk + γ (F(zk) F1(zk) F(uk) + F1(uk))

and broadcasts zk+1 to the workers 4: end for

It is not difﬁcult to check that Algorithm 1 is an instance of the oracle introduced in Deﬁnition 1. It accommodates either exact solutions of the strongly convex subproblems (10) (corresponding to e = 0) or inexact ones (up to tolerance e > 0) the latter can be computed, e.g., using Extragradient method [16], which is optimal in this case.

The communication complexity of the method is proved in the next theorem, which certiﬁes that the proposed algorithm is optimal, i.e., achieves the lower bound (6) on the number of required communications we refer to Appendix B.1 in the supplementary material for a detailed description of the algorithmic tuning as well as a study of the computational complexity when Extragradient method is employed to solve subproblems (10) (up to a suitably chosen tolerance).

Theorem 3 Consider Problem (P) under Assumptions 1-2 over a connected graph G with a master node. Let {zk} be the sequence generated by Algorithm 1 with tuning as described in Appendix B.1 (cf. the supplementary material). Then, given " > 0, the number of communication rounds for

kzk z k2 " is O

4.2 Distributed case (mesh networks)

We consider now mesh networks. Because of the lack of a master node, each agent m now owns local estimates um and vm of the common variables u and v, respectively, which are iteratively updated. At each iteration, a node is selected uniformly at random, which plays the role of the master node, performing thus the update of its own local variables, followed by some rounds of communications via accelerated (inexact) gossip protocols [21, 44] the latter being instrumental to propagate the updates of the u, v-variables and gradients across the network. The algorithm is formally introduced in Algorithm 2, with the accelerated gossip procedure described in Algorithm 3.

Algorithm 2 (Distributed Min-Max Data Similarity Algorithm)

Parameters: stepsize γ, accuracy e, e0, e1, communication rounds H0, H1; Initialization: Choose z0 = (x0, y0) 2 Z, z0

m = z0, for all m 2 [M]; 1: for k = 0, 1, 2, . . . do 2: Communications: F k

1 , . . . F k

M = Acc Gossip(F1(zk

1), . . . FM(zk

3: Local computations: Choose an index mk 2 [M] uniformly at random; then node mk (i) computes vk

(ii) ﬁnds uk

mk, s.t. k uk

mkk2 e, where ˆuk

mk is the solution of:

min ux2X max

γfmk(ux, uy) + 1

4: Communications: Run accelerated gossip to propagate uk

mk and update gradient variables:

1, . . . uk

M = M Acc Gossip(0, . . . , 0, uk

mk, 0 . . . , 0; H1),

1 , . . . F k+1/2

M = Acc Gossip(F1(uk

1), . . . FM(uk

5: Update of zmk-variable: node mk performs

mk + γ ( F k

mk) F k+1/2

mk + Fmk( uk

6: Communications: Run accelerated gossip to propagate zk+1

1 , . . . ˆzk+1

M = M Acc Gossip(0 . . . , 0, zk+1

mk , 0 . . . , 0; H1);

7: Each worker update zk+1

; 8: end for

Algorithm 3 (Acc Gossip)

Input: z1, ..., z M 2 R2d, and H > 0 (communication rounds); Initialization: Construct matrix Z with rows z T

1 , . . . , z T

Z 1 = Z, Z0 = Z, and =

2(W ). 1: for t = 0, 1, 2, . . . , H do

2: Zt+1 = (1 + )WZt Zt 1, 3: end for

Output: Rows of ZH+1

Convergence of the method is established in Theorem 4 below we refer to Appendix B.2 in the supplementary material for a detailed description of the algorithmic tuning [choice of the stepsize γ, precision e, numbers of communications rounds H0, H1, and algorithm to solve (10)].

Theorem 4 Consider Problem (P) under Assumptions 1-2 over a connected graph G. Let {(zk

m)m2[M]} be the sequence generated by Algorithm 2 with tuning as described in Appendix B.2 (cf. the supplementary material) and gossip matrix W satisfying Assumption 3. Then, given " > 0,

the number of communication rounds for k zk z k2 " reads O

, where zk = 1 M

While the algorithm achieves the lower bound (7), up to log-factors (which now however depends on " as well), there is room for improvements. In fact, selecting only one agent at time performing the updates does not fully exploit the potential computational speedup offered by the networking setting. Also, the use of gossip protocols to propagate the updates of a single agent across the entire network seems to be not quite efﬁcient. Designing alternative distributed algorithms overcoming these limitation is a challenging open problem.

5 Numerical Results

We simulate the Robust Linear Regression problem which is deﬁned as

w max krk Rr

(w T (xi + r) yi)2 + λ

2 krk2. (12)

where w are the model weights, {xi, yi}N

i=1 is the training dataset, and r is the artiﬁcially added noise; we use 2-regularization on both w and r. We solve the problem over a master/workers topology; we consider a network with 25 workers. We test Algorithm 1 wherein the subproblems (10) at the master node are solved with high accuracy using Extragradient method. A description of the tuning of the algorithm parameters can be found in Appendix C. The algorithms are implemented in Python 3.73.

Figure 1: Centralized case, simulated data, 25 workers, ambient dimension = 40

Figure 2: Decentralized case, Alg. 2 with different noise

Figure 3: Centralized case, a9a dataset

Our ﬁrst experiment uses synthetic data, which allows us to control the factor δ, measuring statistical similarity of functions over different nodes. Speciﬁcally, we assume all local datasets of size n = 100. The data set {ˆxi, ˆyi}n

i=1 at the master node is generated randomly, with each entry of ˆxi and ˆyi, i = 1, . . . , n drawn from the Standard Gaussian distribution. The datasets at the workers sides, i = 2, . . . , M, are obtained perturbing {ˆxi, ˆyi}n

i=1 by random noise i with controlled variance.

3Source code: https://github.com/alexrogozin12/data_sim_sp

Figure 1 compares the performance of Algorithm 1 and the Centralized Extragradient method [5] applied to Problem (12), under different level of noise added to local datasets (level of similarity), and two different problem and network dimensions we plot the distance of the iterates from the solution versus the number of communications. It can be seen that Algorithm 1 consistently outperforms the Extragradient method in terms of number of communications the smaller the noise (the more similar the local functions are), the larger the gap between the two algorithm (in favor of Algorithm 1). On the other hand, at high noise (amplitude 10.0) the performance of Extragradient and Algorithm 1 become comparable. In addition, we compare the performance of Alg.2 under different noise over networks with different topologies in Figure 2.

Figure 4: Decentralized case, a9a dataset, grid graph

Our second experiment is using real data, speciﬁcally LIBSVM datasets [8]. In this scenario, we do not use additional noise, but still can control the data similarity by choosing the number of workers. The larger the number of workers, the less similar the local functions (less data at each node). Figure 3 compares Algorithm 1 and the Extragradient method: we plot the distance of the iterates from the solution vs. the number of communications. Quite interesting, Algorithm 1 compares favorably even when the number of workers becomes large. Figure 4 compares Algorithm 2 with Decentralized Extragradient method (EGD) [5] and Extragradient method with gradient-tracking (EGD-GT) [27]. The simulations are carried out with parameters tuned according to the theoretical results in the corresponding papers.

6 Conclusion

We studied distristributed SPPs over networks, under data similarity. Such problems arise naturally from many applications, including machine learning and signal processing. We ﬁrst derived lower complexity bounds for such problems for solution methods implementable either on star-networks or on general topologies (modeled as undirected, static graphs). These algorithms are optimal, in the sense that they achieve the lower bounds, up to log factors. The implementation of the proposed method over general network, however, is improvable: by selecting only one agent at time performing the updates, it does not fully exploit the potential computational speedup offered by the parallelism of the networking setting. Also, the use of gossip protocols to propagate the updates of a single agent across the entire network is not very efﬁcient. Another interesting extension would be designing methods that take into account the asymmetry of the function f with respect to the variables x and y (for example, various strong-convexity constants µx and µy). Finally, it would be interesting to combine the proposed methods with stochastic/variance reduction techniques to alleviate the cost of local gradient computations.

Acknowledgments and Disclosure of Funding

The research of A. Rogozin was supported by Russian Science Foundation (project No. 21-71-30005). The work of G. Scutari is supported by the Ofﬁce of Naval Research, under the Grant # N0001421-1-2673. The paper was prepared within the framework of the HSE University Basic Research Program.

[1] S.S. Abadeh, P.M. Esfahani, and D. Kuhn. Distributionally robust logistic regression. In Advances in

Neural Information Processing Systems (Neur IPS)), pages 1576 1584, 2015.

[2] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimiza-

tion. In Proc. of the 28th International Conference on Neural Information Processing Systems (NIPS), volume 1, pages 1756 1764, December 2015.

[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. Proceedings of the

34th International Conference on Machine Learning (ICML), 70(1):214 223, 2017.

[4] Aleksandr Beznosikov, Pavel Dvurechensky, Anastasia Koloskova, Valentin Samokhin, Sebastian U Stich,

and Alexander Gasnikov. Decentralized local stochastic extra-gradient for variational inequalities. ar Xiv preprint ar Xiv:2106.08315, 2021.

[5] Aleksandr Beznosikov, Valentin Samokhin, and Alexander Gasnikov. Distributed saddle-point problems:

Lower bounds, optimal algorithms and federated gans. ar Xiv preprint ar Xiv:2010.13112, 2021.

[6] Aleksandr Beznosikov, Vadim Sushko, Abdurakhmon Sadiev, and Alexander Gasnikov. Decentralized

personalized federated min-max problems. ar Xiv preprint ar Xiv:2106.07289, 2021.

[7] Antonin Chambolle and Thomas Pock. A ﬁrst-order primal-dual algorithm for convex problems with

applications to imaging. Journal of mathematical imaging and vision, 40(1):120 145, 2011.

[8] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions

on intelligent systems and technology (TIST), 2(3):1 27, 2011.

[9] Darina Dvinskikh and Daniil Tiapkin. Improved complexity bounds in wasserstein barycenter problem. In

International Conference on Artiﬁcial Intelligence and Statistics, pages 1738 1746. PMLR, 2021.

[10] F. Facchinei and J.S. Pang. Finite-Dimensional Variational Inequalities and Complementarity Problems.

Springer Series in Operations Research and Financial Engineering. Springer New York, 2007.

[11] Jianqing Fan, Yongyi Guo, and Kaizheng Wang. Communication-efﬁcient accurate statistical estimation.

ar Xiv:1906.04870, 2019.

[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.

Generative adversarial nets. In Advances in Neural Information Processing Systems (Neur IPS)), pages 2672 2680, 2014.

[13] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.

ar Xiv preprint ar Xiv:1412.6572, 2014.

[14] Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, and Laurent Massoulie. Statistically

preconditioned accelerated gradient method for distributed optimization. In Proc. of the 37th International Conference on Machine Learning, volume 119, pages 4203 4227, 13 18 Jul 2020.

[15] Arun Jambulapati, Aaron Sidford, and Kevin Tian. A direct tilde {O}(1/epsilon) iteration parallel algorithm

for optimal transport. Advances in Neural Information Processing Systems, 32:11359 11370, 2019.

[16] Anatoli Juditsky, Arkadii S. Nemirovskii, and Claire Tauvel. Solving variational inequalities with stochastic

mirror-prox algorithm, 2008.

[17] Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji,

Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019.

[18] G. M. Korpelevich. The extragradient method for ﬁnding saddle points and other problems, 1976.

[19] Guanghui Lan. Gradient sliding for composite optimization. Mathematical Programming, 159(1):201 235,

[20] Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-efﬁcient distributed optimization

in networks with gradient tracking and variance reduction. Journal of Machine Learning Research, 21(180):1 51, Sept. 2020.

[21] Ji Liu and A Stephen Morse. Accelerated linear iterations for distributed averaging. Annual Reviews in

Control, 35(2):160 165, 2011.

[22] Mingrui Liu, Wei Zhang, Youssef Mroueh, Xiaodong Cui, Jerret Ross, Tianbao Yang, and Payel Das. A

decentralized parallel algorithm for training generative adversarial nets. ar Xiv preprint ar Xiv:1910.12999, 2019.

[23] Weijie Liu, Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil, Zebang Shen, and Nenggan Zheng. A

decentralized proximal point-type method for saddle point problems. ar Xiv preprint ar Xiv:1910.14380, 2019.

[24] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards

deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

[25] Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Globally convergent newton methods for

ill-conditioned generalized self-concordant losses. In Advances in Neural Information Processing Systems, pages 7636 7646, 2019.

[26] David Mateos-Núñez and Jorge Cortés. Distributed subgradient methods for saddle-point problems. In

2015 54th IEEE Conference on Decision and Control (CDC), pages 5462 5467, 2015.

[27] Soham Mukherjee and Mrityunjoy Chakraborty. A decentralized algorithm for large scale min-max

problems. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 2967 2972, 2020.

[28] A. Nedi c, A. Olshevsky, and M. G. Rabbat. Network topology and communication-computation tradeoffs

in decentralized optimization. Proceedings of the IEEE, 106:953 976, 2018.

[29] Arkadi Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz

continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15:229 251, 01 2004.

[30] Yuri Nesterov. Dual extrapolation and its applications to solving variational inequalities and related

problems. Mathematical Programming, 109(1-2):319 344, 2007.

[31] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.

[32] Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class

of non-convex min-max games using iterative ﬁrst order methods. ar Xiv preprint ar Xiv:1902.08297, 2019.

[33] A. Olshevsky. Linear time average consensus on ﬁxed graphs. In 3rd IFAC Workshop Distrib. Estimation

Control Netw. Syst., 2015.

[34] Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds of ﬁrst-order methods for convex-concave

bilinear saddle-point problems. Mathematical Programming, pages 1 35, 2019.

[35] Sashank J Reddi, Jakub Koneˇcn y, Peter Richtárik, Barnabás Póczós, and Alex Smola. Aide: Fast and

communication efﬁcient distributed optimization. ar Xiv preprint ar Xiv:1608.06879, 2016.

[36] Alexander Rogozin, Alexander Beznosikov, Darina Dvinskikh, Dmitry Kovalev, Pavel Dvurechensky, and

Alexander Gasnikov. Decentralized distributed optimization for saddle point problems. ar Xiv preprint ar Xiv:2102.07758, 2021.

[37] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms

for smooth and strongly convex distributed optimization in networks. ar Xiv preprint ar Xiv:1702.08704, 2017.

[38] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efﬁcient distributed optimization using an

approximate newton-type method. In Proc. of the 31st International Conference on Machine Learning (PMLR), volume 32, pages 1000 1008, 2014.

[39] Y Sun, A Daneshmand, and G Scutari. Distributed optimization based on gradient-tracking revisited:

Enhancing convergence rate via surrogation. ar Xiv preprint ar Xiv:1905.02637, 2019.

[40] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and

Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

[41] Paul Tseng. A modiﬁed forward-backward splitting method for maximal monotone mappings. SIAM

Journal on Control and Optimization, 38(2):431 446, 2000.

[42] J. von Neumann, O. Morgenstern, and H.W. Kuhn. Theory of Games and Economic Behavior (commemo-

rative edition). Princeton University Press, 2007.

[43] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W. Mahoney. Giant: Globally improved ap-

proximate newton method for distributed optimization. In Proc. of the 32nd 32nd International Conference on Neural Information Processing Systems, volume 37, pages 2338 2348, 2018.

[44] Haishan Ye, Luo Luo, Ziang Zhou, and Tong Zhang. Multi-consensus decentralized accelerated gradient

descent. ar Xiv preprint ar Xiv:2005.00797, 2020.

[45] Xiao-Tong Yuan and Ping Li. On convergence of distributed approximate newton methods: Globalization,

sharper bounds and beyond. ar Xiv preprint ar Xiv:1908.02246, 2019.

[46] Junyu Zhang, Mingyi Hong, and Shuzhong Zhang. On lower iteration complexity bounds for the saddle

point problems. ar Xiv preprint ar Xiv:1912.07481, 2019.

[47] Yuchen Zhang and Lin Xiao. Disco: Distributed optimization for self-concordant empirical loss. In Proc.

of the 32nd International Conference on Machine Learning (PMLR), volume 37, pages 362 370, 2015.

[48] Yuchen Zhang and Lin Xiao. Communication-efﬁcient distributed optimization of self-concordant empirical

loss. In Large-Scale and Distributed Optimization, pages 289 341. Springer, 2018.