# local_composite_saddle_point_optimization__332f8882.pdf

Published as a conference paper at ICLR 2024

LOCAL COMPOSITE SADDLE POINT OPTIMIZATION

Site Bai Department of Computer Science Purdue University bai123@purdue.edu

Brian Bullins Department of Computer Science Purdue University bbullins@purdue.edu

Distributed optimization (DO) approaches for saddle point problems (SPP) have recently gained in popularity due to the critical role they play in machine learning (ML). Existing works mostly target smooth unconstrained objectives in Euclidean space, whereas ML problems often involve constraints or non-smooth regularization, which results in a need for composite optimization. Moreover, although non-smooth regularization often serves to induce structure (e.g., sparsity), standard aggregation schemes in distributed optimization break this structure. Addressing these issues, we propose Federated Dual Extrapolation (Fe Dual Ex), an extra-step primal-dual algorithm with local updates, which is the first of its kind to encompass both saddle point optimization and composite objectives under the distributed paradigm. Using a generalized notion of Bregman divergence, we analyze its convergence and communication complexity in the homogeneous setting. Furthermore, the empirical evaluation demonstrates the effectiveness of Fe Dual Ex for inducing structure in these challenging settings.

1 INTRODUCTION

A notable fraction of machine learning (ML) problems belong to saddle point problems (SPP), including adversarial robustness (Madry et al., 2018; Chen & Hsieh, 2023), generative adversarial networks (GAN) (Goodfellow et al., 2014), matrix games (Abernethy et al., 2018), multi-agent reinforcement learning (Wai et al., 2018), among others. These applications call for effective distributed saddle point optimization as their scale evolves beyond centralized learning. In typical distributed optimization (DO) approaches, a central server coordinates collaborative learning among clients through rounds of communication. In each round, clients learn a synchronized global model locally without sharing their private data, then send the model to the server for aggregation, usually through averaging (Mc Mahan et al., 2017; Stich, 2019), to produce a new global model. The cost of communication is known to dominate the optimization process (Koneˇcn y et al., 2016).

Although preliminary progress has been made in distributed saddle point optimization (Beznosikov et al., 2020; Hou et al., 2021), we would note that machine learning problems are commonly associated with task-specific constraints or non-smooth regularization, which results in a need for composite optimization (CO). Moreover, a common purpose for non-smooth regularization is to induce structure. Typical ones include ℓ1 norm for sparsity and nuclear norm for low-rankness, which show up in examples spanning from classical LASSO (Tibshirani, 1996), sparse regression (Hastie et al., 2015) to deep learning such as adversarial example generation (Moosavi-Dezfooli et al., 2016), sparse GAN (Zhou et al., 2020), convexified learning (Sahiner et al., 2022; Bai et al., 2024) and others.

Meanwhile, Yuan et al. (2021) identified the curse of primal averaging in standard aggregation schemes of DO, where the specific regularization-imposed structure on the client models may no longer hold after direct averaging on the server. For instance, each client may be able to obtain a sparse solution, yet averaging the solutions across clients yields a dense solution. To address this issue for convex optimization, they adopted the dual averaging technique (Nesterov, 2009), but this approach is not specifically designed for SPP. Even in the sequential deterministic setting, dual averaging or mirror descent (Nemirovskij & Yudin, 1983) achieve only a O(1/

T) rate for SPP (Bubeck et al., 2015), whereas extra-step methods achieve a O(1/T) rate (Nemirovski, 2004; Nesterov, 2007). At the same time, existing distributed methods for SPP fail to cover these composite scenarios and address associated challenges, as summarized in Table 1.

Published as a conference paper at ICLR 2024

Task Method

Composite & Constrained & Non-Euclidean

Convergence Rate Convexity Assumption

Fed Avg (Khaled et al., 2020) βB RK + σB 1 2

M 1 2 R 1 2 K 1 2 + β 1 3 σ 2 3 B 2 3

K 1 3 R 2 3 convex

Fed Dual Avg (Yuan et al., 2021) βB RK + σB 1 2

M 1 2 R 1 2 K 1 2 + β 1 3 G 2 3 B 2 3

R 2 3 convex

(Ours) Fe Dual Ex βB RK + β 1 2 G 1 2 B 3 4

K 1 4 R 3 4 + σB 1 2

M 1 2 R 1 2 K 1 2 + β 1 3 G 2 3 B 2 3

R 2 3 convex

Extra Step Local SGD (Beznosikov et al., 2020) B exp{ αKR

β } + σ2 MKR + β2σ2

α4K2R2 α-strongly convex-concave SCCAFFOLD-S (Hou et al., 2021) β2

α2 B exp{ αKR

β } + σ2 MKR + β2σ2

α4KR2 α-strongly convex-concave

(Ours) Fe Dual Ex βB RK + β 1 2 G 1 2 B 3 4

K 1 4 R 3 4 + σB 1 2

M 1 2 R 1 2 K 1 2 + β 1 2 G 1 2 B 3 4

R 1 2 convex-concave

Table 1: We list existing convergence rates on composite convex optimization and smooth saddle point optimization in distributed settings similar to ours. Notations are R: communication rounds; K: local steps; β: smoothness; B: diameter; G: gradient bound; M: clients; σ2: gradient variance. Fed Avg is also included as a reference. We further note that none of the work other than ours covers composite SPP. They are included only for completeness.

We present the distributed paradigm for composite saddle point optimization defined in (1). In particular, we propose Federated Dual Extrapolation (Fe Dual Ex) (Algorithm 1), which builds on Nesterov s dual extrapolation (Nesterov, 2007), a classic extra-step algorithm suited for SPP. It carries out a two-step evaluation of a proximal operator (Censor & Zenios, 1992) defined by the Bregman Divergence (Bregman, 1967), which allows for SPP beyond the Euclidean space. To adapt to composite regularization, Fe Dual Ex also draws inspiration from recent progress in composite convex optimization (Yuan et al., 2021) and adopts the notion of generalized Bregman divergence (Flammarion & Bach, 2017) instead, which merges the regularization into its distance-generating function. With some novel technical accommodations, we provide the convergence rate for Fe Dual Ex under the homogeneous setting, which is, to the best of our knowledge, the first convergence rate for composite saddle point optimization under the DO paradigm. In support of the proposed method, we conduct numerical evaluations to verify the effectiveness of Fe Dual Ex on composite SPP.

To further demonstrate the quality of the induced structure, we include the primal twin of Fe Dual Ex based on mirror prox (Nemirovski, 2004), namely Federated Mirror Prox (Fed Mi P) , as a baseline for comparison in Appendix H. This is in line with the dichotomy between Federated Mirror Descent (Fed Mi D) and Federated Dual Averaging (Fed Dual Avg) (Yuan et al., 2021), from which Yuan et al. (2021) identified the curse of primal averaging in DO, i.e., the specific regularization-imposed structure on the client models may no longer hold after primal averaging on the server. It highlights that Fe Dual Ex naturally inherits the merit of dual aggregation from Fed Dual Avg. In addition, we analyze Fe Dual Ex for federated composite convex optimization and show that Fe Dual Ex recovers the same convergence rate as Fed Dual Avg under the convex setting.

Last but not least, by reducing the number of clients to one, we show for the sequential version of Fe Dual Ex that the analysis naturally yields a convergence rate for stochastic composite saddle point optimization which, to our knowledge, is the first such algorithm for non-Euclidean settings and matches the O( 1

T ) rate in general stochastic saddle point optimization (Mishchenko et al., 2020; Juditsky et al., 2011). Further removing the noise from gradient estimates, Fe Dual Ex still generalizes dual extrapolation to deterministic composite saddle point optimization with a O( 1

T ) convergence rate that matches the smooth case and also the pioneering composite mirror prox (Co MP) (He et al., 2015) as presented in Table 2.

Our Contributions:

We propose Fe Dual Ex for distributed learning of SPP with composite possibly non-smooth regularization (Section 4.1). In support of the proposed algorithm, we provide a convergence rate for Fe Dual Ex under the homogeneous setting (Section 4.2). To the best of our knowledge, Fe Dual Ex is the first of its kind that encompasses composite possibly non-smooth regularization for SPP under a distributed paradigm, as shown in Table 1.

Additionally, we showcase the structure-preserving (e.g., sparsity) advantage of Fe Dual Ex achieved through dual-space averaging. In particular, we present its primal twin Fed Mi P as a baseline to highlight this contrast (Appendix H).

Published as a conference paper at ICLR 2024

Noise Rate Composite SPP Smooth SPP

Deterministic O 1

Deterministic Fe Dual Ex (Ours)

Co MP (He et al., 2015)

Accelerated Proximal Gradient (Tseng, 2008)

Dual Extrapolation (Nesterov, 2007)

Mirror Prox (Nemirovski, 2004)

Stochastic O 1

Sequential Fe Dual Ex (Ours) Extragradeint (Euclidean) (Mishchenko et al., 2020)

Sequential Fe Dual Ex (Ours) Mirror Prox (Juditsky et al., 2011)

Table 2: Convergence rates for convex-concave SPP. The deterministic version of Fe Dual Ex generalizes dual extrapolation (DE) to composite SPP, and the sequential version of Fe Dual Ex generalizes DE to both smooth and composite stochastic saddle point optimization.

Fe Dual Ex produces several byproducts in the CO realm, as demonstrated in Table 2 : (1) The sequential version of Fe Dual Ex leads to the stochastic dual extrapolation for CO and yields, to our knowledge, the first convergence rate for the stochastic optimization of composite SPP in non-Euclidean settings . (2) Further removing the noise leads to its deterministic version, with rates matching existing ones in smooth and composite saddle point optimization (Section 5). We demonstrate experimentally the effectiveness of Fe Dual Ex on various composite saddle point tasks, including bilinear problems on synthetic data with ℓ1 and nuclear norm regularization, as well as the universal adversarial training of logistic regression with MNIST and CIFAR-10 (Section 6).

2 RELATED WORK

We provide a brief overview of some related work and defer extended discussions to Appendix B.

The distributed optimization paradigm we consider aligns with that in Local SGD (Stich, 2019), which is also the homogeneous setting of Federated Averaging (Fed Avg) (Mc Mahan et al., 2017). Stich (2019) provides the first convergence rate for Fed Avg, and it has been improved with tighter analysis and also analyzed under heterogeneity (e.g., (Khaled et al., 2020; Woodworth et al., 2020b)). Recently, Yuan et al. (2021) extended Fed Avg to composite convex optimization and proposed Fed Dual Avg that aggregates learned parameters in the dual space and overcomes the curse of primal averaging in federated composite optimization.

For SPP, Beznosikov et al. (2020) investigate the distributed extra-gradient method for stronglyconvex strongly-concave SPP in the Euclidean space. Hou et al. (2021) propose Fed Avg-S and SCAFFOLD-S based on Fed Avg (Mc Mahan et al., 2017) and SCAFFOLD (Karimireddy et al., 2020) for SPP, which yields similar convergence rate to (Beznosikov et al., 2020). In addition, Ramezani-Kebrya et al. (2023) study the problem from the information compression perspective with the measure of communication bits. Yet, the aforementioned works are limited to smooth and unconstrained SPP in the Euclidean space. The more general setting of composite SPP is only found in sequential optimization literature, where the representative composite mirror prox (Co MP) (He et al., 2015) generalizes the classic mirror prox (Nemirovski, 2004) yet keeps the O( 1

T ) convergence rate. In the stochastic setting, Mishchenko et al. (2020) analyzed a variant of stochastic mirror prox (Juditsky et al., 2011), which is then capable of handling composite terms in the Euclidean space. We will later show that the sequential analysis of our proposed algorithm also yields the same rate for dual extrapolation (Nesterov, 2007) in composite optimization, utilizing different proving techniques. As a result, we focus on the distributed optimization of composite SPP and propose Fe Dual Ex.

3 PRELIMINARIES AND DEFINITIONS

We provide some preliminaries and definitions necessary for introducing Fe Dual Ex. More details are included in Appendix C.1. To begin with, we lay out the notations.

Notations. We use [n] to represent the set {1, 2, ..., n}. We use to denote an arbitrary norm, to denote the dual norm, and 2 to denote the Euclidean norm. We use for gradients, for subgradients, and , for inner products. Related to the algorithm, we use English letters (e.g., z, x, y) to denote primal variables, Greek letters (e.g., ω, ς, µ, ν) to denote dual variables. We use R for communication rounds, K for local updates, B for diameter bound, G for gradient bound, β for smoothness constant, σ for standard deviation, ξ for random samples. We use h to denote the convex conjugate of a function h.

Published as a conference paper at ICLR 2024

Composite Saddle Point Optimization. We study composite saddle point optimization. Its objective is formally given in the following definition. Definition 1 (Composite SPP). The objective of composite saddle point optimization is defined as

min x X max y Y ϕ(x, y) = f(x, y) + ψ1(x) ψ2(y) (1)

where f(x, y) = 1 M PM m=1 fm(x, y) and ψ1(x), ψ2(y) are possibly non-smooth.

It is typically evaluated by the duality gap: Gap(ˆx, ˆy) = maxy Y ϕ(ˆx, y) minx X ϕ(x, ˆy).

xt = Prox h x(µt)

xt+1/2 = Prox h xt(ηg(xt))

µt+1 = µt + ηg(xt+1/2)

Figure 1: Dual Extrapolation.

Mirror Prox and Dual Extrapolation. Mirror prox (Nemirovski, 2004) and dual extrapolation (Nesterov, 2007) are classic methods for convex-concave SPP. Both are proximal algorithms based on the proximal operator defined as Prox h x ( ) = arg minx{ , x + V h x (x)}, in which V h x (x) = h(x) h(x ) h(x ), x x is the Bregman divergence generated by some closed, strongly convex, and differentiable function h. Both algorithms conduct two evaluations of the proximal operator, while dual extrapolation carries out updates in the dual space. Figure 1 gives a brief illustration of dual extrapolation with the proximal operator as in (Cohen et al., 2021), with details in Appendix C.1.

Generalized Bregman Divergence. Recent advances in composite convex optimization (Yuan et al., 2021) have utilized the Generalized Bregman Divergence (Flammarion & Bach, 2017) for analyzing composite objectives. It incorporates the composite term into the distance-generating function of the vanilla Bregman divergence, and measures the distance in terms of one variable and the dual image of the other, with the key insight being the conjugate of a non-smooth generalized distance-generating function is differentiable. Definition 2 (Generalized Bregman Divergence (Flammarion & Bach, 2017)). Generalized Bregman divergence is defined to be V ht µ (x) = ht(x) ht( h t (µ )) µ , x h t (µ ) , where ht = h+tηψ is a generalized distance-generating function that is closed and strongly convex, t is the current number of iterations, η is the step size, h t is the convex conjugate of ht, and µ is the dual image of x , i.e., µ ht(x ) and x = h t (µ ).

Generalized Bregman divergence is suitable not only for non-smooth regularization but also for any convex constraints C, taking ψ(x) = 0 if x C and + otherwise.

4 FEDERATED DUAL EXTRAPOLATION (FEDUALEX)

To tackle composite SPP in the DO paradigm, we acknowledge the challenges from several aspects. Specifically, the generality afforded by composite and/or saddle point problems results in a need for more sophisticated techniques that work with this additional structure. These concerns are further complicated by the challenges that arise for DO, where communication and aggregation need to be carefully handled under the distributed mechanism. In particular, Yuan et al. (2021) identified the the curse of primal averaging in composite federated optimization and advocated for dual aggregation. Dealing with these challenges altogether is rather non-trivial, as the techniques that are naturally suited for one would fail for another. In this regard, we first present Fe Dual Ex (Algorithm 1) and several relevant novel definitions proposed for its adaptation to composite SPP. Then we analyze the convergence rate in the homogeneous setting.

4.1 THE FEDUALEX ALGORITHM

Fe Dual Ex builds its core on the classic dual extrapolation, an extra-step algorithm geared for saddle point optimization. Its effectiveness has been widely verified in vanilla smooth convex-concave SPP. Furthermore, its updating sequence lies in the dual space which would naturally inherit the advantage of dual aggregation in composite federated optimization. The challenge remains for composite optimization, as relevant work is limited, and the existing composite extension for the extra-step method (He et al., 2015) is quite technically involved. Given that the smooth analysis of dual extrapolation is already non-trivial (Nesterov, 2007), no attempts were previously made for generalizing dual extrapolation to the composite optimization realm.

Published as a conference paper at ICLR 2024

Algorithm 1 FEDERATED-DUAL-EXTRAPOLATION (Fe Dual Ex) for Composite SPP

Input: ϕ(z) = f(x, y)+ψ1(x) ψ2(y) = 1 M PM m=1 fm(x, y)+ψ1(x) ψ2(y): objective function; ℓ(z): distance-generating function; gm(z) = ( xfm(x, y), yfm(x, y)): gradient operator. Hyperparameters: R: number of communication rounds; K: number of local update iterations; ηs: server step size; ηc: client step size. Dual Initialization: ς0 = 0: initial dual variable, ς: fixed point in the dual space. Output: Approximate solution z = (x, y) to minx X maxy Y ϕ(x, y)

1: for r = 0, 1, . . . , R 1 do 2: Sample a subset of clients Cr [M] 3: for m Cr in parallel do 4: ςm r,0 = ςr 5: for k = 0, 1, . . . , K 1 do

6: zm r,k = Prox ℓr,k ς (ςm r,k) Two-step evaluation of the generalized proximal operator

7: zm r,k+1/2 = Prox ℓr,k+1 ς ςm r,k(ηcgm(zm r,k; ξm r,k))

8: ςm r,k+1 = ςm r,k + ηcgm(zm r,k+1/2; ξm r,k+1/2) Dual variable update 9: end for 10: end parallel for 11: r = 1 |Cr| P

m Cr(ςm r,K ςm r,0) 12: ςr+1 = ςr + ηs r Server dual update 13: end for 14: Return: 1 RK PR 1 r=0 PK 1 k=0 \ zr,k+1/2 with \ zr,k+1/2 defined in (4).

Inspired by recent advances in composite convex optimization, we recognize the Generalized Bregman Divergence (Flammarion & Bach, 2017) as a powerful tool for analyzing proximal methods for composite objectives. Adapting to the context of composite SPP, we make an extension to the Generalized Bregman Divergence for saddle functions, and provide the definition below. Definition 3 (Generalized Bregman Divergence for Saddle Functions). The generalized distancegenerating function for the optimization of (1) is ℓt(z) = ℓ(z)+tηψ(z), where ℓ(z) = h1(x)+h2(y), h1 and h2 are distance-generating functions for x and y, ψ(z) = ψ1(x) + ψ2(y), η is the step size, and t is the current number of iterations. It generates the following generalized Bregman divergence: V ℓt ς (z) = ℓt(z) ℓt(z ) ς , z z ,

where ς is the preimage of z with respect to the gradient of the conjugate of ℓt, i.e., z = ℓ t (ς ).

Yet as we notice in previous works (Flammarion & Bach, 2017; Yuan et al., 2021), generalized Bregman divergence is applied only for theoretical analysis. In terms of algorithm design, the previous proximal operator for composite convex optimization is based on the vanilla Bregman divergence plus the composite term, specifically, arg minx{ , x + V h x (x) + ηψ(x)} in (Duchi et al., 2010; He et al., 2015), and arg minx{ , x + h(x) + ηtψ(x)} in (Xiao, 2010; Flammarion & Bach, 2017). However, we find this definition insufficient for dual extrapolation, as its dual update and the composite term from the extra step break certain parts of the analysis. In this effort, we propose a novel technical change to the proximal operator, directly replacing the Bregman divergence in the proximal operator with the generalized Bregman divergence. Definition 4 (Generalized Proximal Operator for Saddle Functions). A proximal operation in the composite setting with generalized Bregman divergence for Saddle Functions is defined to be Prox ℓt ς (g) := arg min z { g, z + V ℓt ς (z)},

where ς is the dual image of z , i.e., z = ℓ t (ς ), and ς ℓt(z ) = ℓ(z ) + ηt ψ(z ). Compared with the vanilla proximal operator in Section 3, this novel design for the composite adaptation of dual extrapolation is quite natural. It is different from previous proximal operators, which after expanding take the form arg minz{ ℓ(z ), z + ℓt(z)} (Duchi et al., 2010) or

arg minz{ , z + ℓt(z)} (Xiao, 2010), whereas ours is Prox h ς ( ) = arg minz{ ς , z + ℓt(z)}.

These adaptations are necessary for technical reasons, as our algorithm involves prox operators on both the clients and the server to induce structure in the aggregated solution, which would otherwise

Published as a conference paper at ICLR 2024

break the conventional analysis. (Specifically, using these previous notions yields extra composite terms in the analysis that do not cancel out but rather accumulate, thus hindering the convergence.)

With the novel definitions above, we are able to formally present Fe Dual Ex in Algorithm 1. It follows the general structure of DO. For each client, the two-step evaluation of the generalized proximal operator and the final dual update are highlighted in green , which resembles the classic dual extrapolation updates in Figure 1. To align with our generalized proximal operator, we also move the primal initialization x in the original dual extrapolation to the dual space as ς. On the server, the dual variables from clients are aggregated first in the dual space, then projected to the primal with a mechanism later defined in (4).

4.2 CONVERGENCE ANALYSIS OF FEDUALEX

In this section, we provide the convergence analysis of Fe Dual Ex for the homogeneous DO of composite SPP. We further assume the full participation of clients in each round for simplicity, but this condition can be trivially removed by lengthy analysis. We start by showing the equivalence between primal-dual projection and the generalized proximal operator, and for the convenience of analysis, reformulating the updating sequences with another pair of auxiliary dual variables.

Projection Reformulation. Generalized proximal operators can be presented as projections, i.e., the gradient of the conjugate of the generalized distance-generating function in Appendix C.2. Thus, line 6 to 8 in Algorithm 1 can be expanded by Definition 4, and rewrite as: (1) zm r,k = ℓ r,k( ς ςm r,k); (2) zm r,k+1/2 = ℓ r,k+1(( ς ςm r,k) ηcgm(zm r,k; ξm r,k)); (3) ςm r,k+1 = ςm r,k + ηcgm(zm r,k+1/2; ξm r,k+1/2).

Further define auxiliary dual variable ωm r,k = ς ςm r,k. It satisfies immediately that zm r,k = ℓ r,k(ωm r,k), in which ℓ r,k is the conjugate of ℓr,k = ℓ+ (ηsr K + k)ηcψ. And define ωm r,k+1/2 to be the dual image of the intermediate variable zm r,k+1/2 such that zm r,k+1/2 = ℓ r,k+1(ωm r,k+1/2). Then we get an equivalent updating sequence with the auxiliary dual variables.

ωm r,k+1/2 = ωm r,k ηgm(zm r,k; ξm r,k), ωm r,k+1 = ωm r,k ηgm(zm r,k+1/2; ξm r,k+1/2)

Define their average across clients, ωr,k = 1 M PM m=1 ωm r,k, gr,k = 1 M PM m=1 gm(zm r,k; ξm r,k). Then we can analyze the following averaged dual shadow sequences:

ωr,k+1/2 = ωr,k ηcgr,k, (2) ωr,k+1 = ωr,k ηcgr,k+1/2. (3)

In the meantime, their shadow primal projections on the server are defined as

d zr,k = ℓ r,k(ωr,k), \ zr,k+1/2 = ℓ r,k+1(ωr,k+1/2). (4)

Next, we list the key assumptions. Detailed presentation and additional remarks that ease the understanding of proofs are also provided in Appendix C.3. Assumptions For the composite saddle function ϕ(x, y) = 1 M PM m=1 fm(x, y) + ψ1(x) ψ2(y), its gradient operator is given by g = ( xf, yf) and g = 1 M PM m=1 gm. We assume that

a.(Convexity of f) m [M], fm(x, y) is convex in x and concave in y. b.(Convexity of ψ) ψ1(x) is convex in x, and ψ2(y) is convex in y.

c.(Lipschitzness of g) gm(z) = h xfm(x,y) yfm(x,y) i is β-Lipschitz: gm(z) gm(z ) β z z

d.(Unbiased Estimate and Bounded Variance) m [M], for random sample ξm, Eξ[gm(zm; ξm)] = gm(zm), and Eξ gm(zm; ξm) gm(zm) 2 σ2

e.(Bounded Gradient) m [M], gm(zm; ξm) G f. The distance-generating function ℓis a Legendre function that is 1-strongly convex, i.e., z, z , ℓ(z ) ℓ(z) ℓ(z), z z 1

2 z z 2. g.The optimization domain Z is compact w.r.t. Bregman divergence, i.e., z, z Z, V ℓ z (z) B.

We would note that Assumption e (bounded gradient) is a standard assumption in classic distributed composite optimization (Duchi et al., 2011), and is made in other DO analysis (Stich, 2019; Li et al., 2020b; Yu et al., 2019; Yuan et al., 2021).

Main Theorem. Under the aforementioned assumptions, we present the following theorem that provides the convergence rate of Fe Dual Ex in terms of the duality gap.

Published as a conference paper at ICLR 2024

Theorem 1 (Main). Under assumptions, the duality gap evaluated with the ergodic sequence generated by the intermediate steps of Fe Dual Ex in Algorithm 1 is bounded by

k=0 \ zr,k+1/2 i B ηc RK + 20β2(ηc)3K2G2 + 5σ2ηc

M + 2 3 2 βηc KGB 1 2 .

Choosing step size ηc = min{ 1

5 1 2 β , B 1 4

20 1 4 β 1 2 G 1 2 K 3 4 R 1 4 , B 1 2 M 1 2

5 1 2 σR 1 2 K 1 2 , B 1 4

2 3 4 β 1 2 G 1 2 KR 1 2 },

k=0 \ zr,k+1/2 i 5 1 2 βB RK + 20 1 4 β 1 2 G 1 2 B 3 4

K 1 4 R 3 4 + 5 1 2 σB 1 2

M 1 2 R 1 2 K 1 2 + 2 3 4 β 1 2 G 1 2 B 3 4

To the best of our knowledge, this is the first convergence rate for federated composite saddle point optimization. The O( 1 RK ) and O( 1

MRK ) terms roughly match previous DO algorithms, where

the noise term decays with the number of clients M. If M is large enough, then the O(1/R 1 2 ) term takes domination in terms of communication complexity. The convergence analysis further validates the effectiveness of Fe Dual Ex, which then advances distributed optimization to a broad class of composite saddle point problems. The complete proof of Theorem 1 can be found in Appendix E.

On Composite Convex Optimization. We also analyze the convergence rate for Fe Dual Ex under the federated composite convex optimization setting. As the following theorem shows, Fe Dual Ex achieves the same O(1/R 2 3 ) as in (Yuan et al., 2021). The proof is provided in Appendix F. Theorem 2. Under the convex counterparts of previous assumptions, choosing step size ηc =

5 1 2 β , B 1 4

20 1 4 β 1 2 G 1 2 K 3 4 R 1 4 , B 1 2 M 1 2

5 1 2 σR 1 2 K 1 2 , B 1 3

2 1 3 β 1 3 G 2 3 KR 1 3 }, the ergodic intermediate sequence gener-

ated by Fe Dual Ex for composite convex objectives satisfies

k=0 \ xr,k+1/2) ϕ(x) 5 1 2 βB RK + 20 1 4 β 1 2 G 1 2 B 3 4

K 1 4 R 3 4 + 5 1 2 σB 1 2

M 1 2 R 1 2 K 1 2 + 2 1 3 β 1 3 G 2 3 B 2 3

Even though this rate is not preserved in composite saddle point optimization, we note that the optimization of SPP is much more general, and convexity itself is a stronger assumption. More specifically, the complicated setting, including the non-smooth term, the primal-dual projection, the extra-step saddle point optimization, etc., together limit the tools available for analysis.

Remark On Heterogeneity. Even for federated composite optimization (Yuan et al., 2021), the heterogeneous setting presents significant hurdles. Specifically, the involvement of heterogeneity is limited to quadratic functions, under which assumption the is gradient linear, and this simplifies the analysis. It further relies on the norm generated by its Hessian. For saddle functions, quadraticity (as well as a matrix-induced norm) is less well-defined, as the Jacobian of their gradient operator is not (symmetric) positive semidefinite in general. Such further advancements go beyond the scope of this paper. Thus, we regard Theorem 1 as a significant start for federated learning of composite SPP.

5 FEDUALEX IN SEQUENTIAL SETTINGS

Stochastic Composite Saddle Point Optimization Fe Dual Ex can be naturally reduced to sequential stochastic optimization of composite SPP, which we term as Sequential Fe Dual Ex or Stochastic Dual Extrapolation. By reducing the number of clients to one, thus eliminating the need for communication, the convergence analysis follows through smoothly and yields O( 1

T ) rate (denoting K as T) expected for first-order stochastic algorithms. This is the first such rate in the non-Euclidean setting, matching the previous Euclidean rate (Mishchenko et al., 2020) and non-composite rate (Juditsky et al., 2011). Theorem 3 gives the result with proof in Appendix G.1.

Theorem 3. Under the sequential versions of previous assumptions, z Z, choosing step size

3 1 2 β , B 1 2

3 1 2 σT 1 2 }, the ergodic intermediate sequence of stochastic dual extrapolation satisfies

T PT 1 t=0 zt+1/2) 3 1 2 βB

T + 3 1 2 σB 1 2

Published as a conference paper at ICLR 2024

min x X max y Y Ax b, y + λ x 1 λ y 1

A Rn m, X = {Rm : x D}, b Rn, Y = {Rn : y D}.

Figure 2: The composite SPP with ℓ1 regularization for sparsity (Jiang & Mokhtari, 2022).

min X X max Y Y Tr (AX B) Y + λ X λ Y

A Rn m, X = {Rm p : X 2 D},

B Rn p, Y = {Rn p : Y 2 D}.

Figure 3: The composite SPP with nuclear norm lowrank regularization.

Deterministic Composite Saddle Point Optimization Further removing the noise in gradient, Fe Dual Ex reduces to a deterministic algorithm for composite SPP. Even so, we are still generalizing the classic dual extrapolation algorithm to CO, and thus term the algorithm Deterministic Fe Dual Ex or Composite Dual Extrapolation. Following a similar analysis, we are able to get the O( 1

T ) rate as in previous work for CO (He et al., 2015) as well as the smooth dual extrapolation (Nesterov, 2007). The proof for Theorem 4 is in Appendix G.2, which is a much simpler one based on the recently proposed Relative Lipschitzness (Cohen et al., 2021). Theorem 4. Under the basic convexity assumption and β-Lipschitzness of g, z Z and η 1

composite dual extrapolation satisfies Gap( 1

T PT 1 t=0 zt+1/2) βB

6 EXPERIMENTS

To complement our largely theoretical results, we verify in this section the effectiveness of Fe Dual Ex by numerical evaluation. Additional experiments and detailed settings are deferred to Appendix A.

Composite Bilinear SPP. We first test Fe Dual Ex on composite bilinear problems with synthetic data. The problems considered are demonstrated in Figure 2 and 3, in which m = 600, n = 300, p = 20, λ = 0.1, D = 0.05. The corresponding composite terms are ℓ1 regularization with ℓ ball constraint and nuclear regularization with spectral constraint. The purpose of ℓ1 regularization is to encourage sparsity and nuclear regularization to encourage a solution with low rank.

We compare Fe Dual Ex against Fed Dual Avg, Fed Mi D (Yuan et al., 2021), and Fed Mi P proposed in Algorithm 2 in Appendix H. We note that methods like Extra Step Local SGD (Beznosikov et al., 2020) and SCAFFOLD-S (Karimireddy et al., 2020) are not suited to problems with nonsmooth terms, but we include one of them for completeness, given that their rates are similar. For such a comparison, one can only compute the sub-gradient instead of the gradient (which does not everywhere exist). Projection needs to be applied as well to account for the constraints.

0 2000 4000 Communication Rounds

Duality Gap

0 2000 4000 Communication Rounds

(a) One Local Update (K = 1)

0 200 400 Communication Rounds

Duality Gap

0 200 400 Communication Rounds

(b) Ten Local Updates (K = 10)

Figure 4: Duality gap and sparsity of the solution for ℓ1 regularized SPP with ℓ constraint.

0 20 40 60 80 100 Communication Rounds

Duality Gap

0 20 40 60 80 100 Communication Rounds

0 20 40 60 80 100 Communication Rounds

0 5 10 15 20 Communication Rounds

Duality Gap

0 5 10 15 20 Communication Rounds

0 5 10 15 20 Communication Rounds

Figure 5: Duality gap and rank of the solution to the nuclear norm regularized SPP.

Published as a conference paper at ICLR 2024

5 10 15 Communication Rounds

5 10 15 Communication Rounds

Fe Dual Ex Projected Gradient Descent Ascent

0 20 40 Communication Rounds

0 20 40 Communication Rounds

(b) CIFAR-10

Figure 6: Universal adversarial training loss and validation accuracy of logistic regression on unattacked data.

(a) Fe Dual Ex

Figure 7: Attack generated from the universal-adversarially trained logistic regression on MNIST and CIFAR-10.

We evaluate the convergence in terms of the duality gap and also demonstrate the structure of the solution, i.e., sparsity or low-rankness. The duality gap of the problems of interest can be evaluated in closed form, which is derived in Appendix A.1 and A.2. The sparsity is measured by the ratio of non-zero entries to the parameter size, and we regard numbers less than 10 5 as zeros. For DO, we simulate M = 100 clients. For the gradient query of each client in each local update, we inject a Gaussian noise from N(0, σ2), where σ = 0.1. The evaluation is conducted for two different settings: (a) K = 1 local update (b) K = 10 local updates. The results are demonstrated in Figure 4 and 5.

From the duality gap curves in Figure 4, we see that extra-step methods, i.e., Fe Dual Ex and Fed Mi P converge to the order of 10 1 whereas Fed Dual Avg and Fed Mi D stay above 100. Thus, it is evident that methods for composite convex optimization are no longer suited for composite saddle point optimization, and Fe Dual Ex provides the first effective solution addressing the challenge. From the sparsity of the solution, we see that the dual methods demonstrate better adherence to regularization. Among the methods superior in saddle point optimization, Fe Dual Ex reaches a sparsity of around 0.7 while Fed Mi P is around 0.95. This aligns with the previous analysis on the advantage of dual aggregation and further validates the effectiveness of Fe Dual Ex for solving composite SPP. In addition, methods for smooth unconstrained optimization like Extra Step Local SGD do not converge for SPP with composite non-smooth terms, nor does it impose any desired structure, e.g., sparsity. We observe similar advantages of Fe Dual Ex in convergence and inducing low-rankness from Figure 5 as well.

Universal Adversarial Training of Logistic Regression. We also consider the task of universal adversarial training (Shafahi et al., 2020) of logistic regression, i.e. the adversarial training against a universal adversarial perturbation (Moosavi-Dezfooli et al., 2017) targeted for all images in the dataset. In order to encourage the sparsity of the attack, we also impose an l1 regularization on the attack. The problem formulation is given in Appendix A.3. We compare Fe Dual Ex against direct aggregation of projected gradient descent ascent (PGDA) proposed in (Shafahi et al., 2020) Alg. 3.

We evaluate convergence with training loss, which is by no means an exact reflection of the duality gap. Still, we observe in Figure 6 that Fe Dual Ex converges faster and delivers a better-hardened model with higher validation accuracy on unattacked data. Meanwhile, the vanilla aggregation of PGDA solutions yields a dense attack whereas Fe Dual Ex achieves much better sparsity, as visualized in Figure 7. Furthermore, we observe that the attack generated by distributed PGDA is not only dense but also smoothed out to small values close to zero, averaged by the number of clients.

7 CONCLUSION AND FUTURE WORK

We advance distributed optimization to the broad class of composite SPP by proposing Fe Dual Ex and providing, to our knowledge, the first convergence rate of its kind. We demonstrate the effectiveness of Fe Dual Ex for inducing structures with empirical evaluation. We also show that the sequential version of Fe Dual Ex provides a solution to composite stochastic saddle point optimization in the non Euclidean setting. We recognize further study of the heterogeneous federated setting of composite saddle point optimization would be a challenging direction for future work.

Published as a conference paper at ICLR 2024

Jacob Abernethy, Kevin A. Lai, Kfir Y. Levy, and Jun-Kun Wang. Faster rates for convex-concave games. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 1595 1625. PMLR, 06 09 Jul 2018. URL https://proceedings.mlr.press/v75/ abernethy18a.html.

Kimon Antonakopoulos, Veronica Belmega, and Panayotis Mertikopoulos. An adaptive mirror-prox method for variational inequalities with singular operators. Advances in Neural Information Processing Systems, 32, 2019.

K. J. Arrow, L. Hurwicz, and H. Uzawa. Studies in linear and non-linear programming. Stanford University Press, 1958.

Jean-François Aujol and Antonin Chambolle. Dual norms and image decomposition models. International journal of computer vision, 63:85 104, 2005.

Necdet Serhat Aybat and Erfan Yazdandoost Hamedani. A primal-dual method for conic constrained distributed optimization problems. Advances in neural information processing systems, 29, 2016.

Site Bai, Chuyang Ke, and Jean Honorio. On the dual problem of convexified convolutional neural networks. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=0y Mu Nezw J1.

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003.

Aleksandr Beznosikov and Alexander Gasnikov. Similarity, compression and local steps: Three pillars of efficient communications for distributed variational inequalities. ar Xiv preprint ar Xiv:2302.07615, 2023.

Aleksandr Beznosikov, Valentin Samokhin, and Alexander Gasnikov. Distributed saddle-point problems: Lower bounds, optimal and robust algorithms. ar Xiv preprint ar Xiv:2010.13112, 2020.

Aleksandr Beznosikov, Gesualdo Scutari, Alexander Rogozin, and Alexander Gasnikov. Distributed saddle-point problems under data similarity. Advances in Neural Information Processing Systems, 34:8172 8184, 2021.

Aleksandr Beznosikov, Pavel Dvurechenskii, Anastasiia Koloskova, Valentin Samokhin, Sebastian U Stich, and Alexander Gasnikov. Decentralized local stochastic extra-gradient for variational inequalities. Advances in Neural Information Processing Systems, 35:38116 38133, 2022.

Ekaterina Borodich, Vladislav Tominin, Yaroslav Tominin, Dmitry Kovalev, Alexander Gasnikov, and Pavel Dvurechensky. Accelerated variance-reduced methods for saddle-point problems. EURO Journal on Computational Optimization, 10:100048, 2022.

Ekaterina Borodich, Georgiy Kormakov, Dmitry Kovalev, Aleksandr Beznosikov, and Alexander Gasnikov. Optimal algorithm with complexity separation for strongly convex-strongly concave composite saddle point problems. ar Xiv preprint ar Xiv:2307.12946, 2023.

Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Kristian Bredies, Dirk A Lorenz, and Stefan Reiterer. Minimization of non-smooth, non-convex functionals by iterative thresholding. Journal of Optimization Theory and Applications, 165: 78 112, 2015.

L.M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 217, 1967. ISSN 0041-5553. doi: https://doi.org/ 10.1016/0041-5553(67)90040-7. URL https://www.sciencedirect.com/science/ article/pii/0041555367900407.

Published as a conference paper at ICLR 2024

Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A review of image denoising algorithms, with a new one. Multiscale modeling & simulation, 4(2):490 530, 2005.

Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015.

Brian Bullins and Kevin A Lai. Higher-order methods for convex-concave min-max optimization and monotone variational inequalities. SIAM Journal on Optimization, 32(3):2208 2229, 2022.

Brian Bullins, Kshitij Patel, Ohad Shamir, Nathan Srebro, and Blake E Woodworth. A stochastic newton algorithm for distributed convex optimization. Advances in Neural Information Processing Systems, 34:26818 26830, 2021.

Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956 1982, 2010.

Xuanyu Cao, Tamer Ba sar, Suhas Diggavi, Yonina C Eldar, Khaled B Letaief, H Vincent Poor, and Junshan Zhang. Communication-efficient distributed learning: An overview. IEEE journal on selected areas in communications, 2023.

Y Censor and SA Zenios. Proximal minimization algorithm with d-functions. Journal of Optimization Theory and Applications, 73(3):451 464, 1992.

A. Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40:120 145, 2011.

Antonin Chambolle and Thomas Pock. On the ergodic convergence rates of a first-order primal dual algorithm. Mathematical Programming, 159(1-2):253 287, 2016.

Cheng Chen, Luo Luo, Weinan Zhang, and Yong Yu. Efficient projection-free algorithms for saddle point problems. Advances in Neural Information Processing Systems, 33:10799 10808, 2020.

Pin-Yu Chen and Cho-Jui Hsieh. Chapter 12 - adversarial training. In Pin-Yu Chen and Cho-Jui Hsieh (eds.), Adversarial Robustness for Machine Learning, pp. 119 125. Academic Press, 2023. ISBN 978-0-12-824020-5. doi: https://doi.org/10.1016/B978-0-12-824020-5.00023-5. URL https: //www.sciencedirect.com/science/article/pii/B9780128240205000235.

Michael B. Cohen, Aaron Sidford, and Kevin Tian. Relative lipschitzness in extragradient methods and a direct recipe for acceleration. In James R. Lee (ed.), 12th Innovations in Theoretical Computer Science Conference, ITCS 2021, January 6-8, 2021, Virtual Conference, volume 185 of LIPIcs, pp. 62:1 62:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. doi: 10.4230/ LIPIcs.ITCS.2021.62. URL https://doi.org/10.4230/LIPIcs.ITCS.2021.62.

Patrick L Combettes and Jean-Christophe Pesquet. Primal-dual splitting algorithm for solving inclusions with mixtures of composite, lipschitzian, and parallel-sum type monotone operators. Set-Valued and variational analysis, 20(2):307 330, 2012.

Alexandros G Dimakis, Anand D Sarwate, and Martin J Wainwright. Geographic gossip: Efficient aggregation for sensor networks. In Proceedings of the 5th international conference on Information processing in sensor networks, pp. 69 76, 2006.

John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In Conference on Learning Theory (COLT), volume 10, pp. 14 26. Citeseer, 2010.

John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3): 592 606, 2011.

Nicolas Flammarion and Francis Bach. Stochastic composite least-squares regression with convergence rate o(1/n). In Conference on Learning Theory (COLT), pp. 831 875. PMLR, 2017.

Margalit R Glasgow, Honglin Yuan, and Tengyu Ma. Sharp bounds for federated averaging (local sgd) and continuous perspective. In International Conference on Artificial Intelligence and Statistics, pp. 9050 9090. PMLR, 2022.

Published as a conference paper at ICLR 2024

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/ file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.

Vipul Gupta, Avishek Ghosh, Michał Derezi nski, Rajiv Khanna, Kannan Ramchandran, and Michael W. Mahoney. Localnewton: Reducing communication rounds for distributed learning. In Cassio de Campos and Marloes H. Maathuis (eds.), Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 of Proceedings of Machine Learning Research, pp. 632 642. PMLR, 27 30 Jul 2021. URL https://proceedings. mlr.press/v161/gupta21a.html.

Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. Advances in Neural Information Processing Systems, 32, 2019.

Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.

Niao He, Anatoli Juditsky, and Arkadi Nemirovski. Mirror prox algorithm for multi-term composite minimization and semi-separable problems. Computational Optimization and Applications, 61: 275 319, 2015.

Yunlong He and Renato DC Monteiro. Accelerating block-decomposition first-order methods for solving composite saddle-point and two-player nash equilibrium problems. SIAM Journal on Optimization, 25(4):2182 2211, 2015.

Yunlong He and Renato DC Monteiro. An accelerated hpe-type algorithm for a class of composite convex-concave saddle-point problems. SIAM Journal on Optimization, 26(1):29 56, 2016.

Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis. Springer Science & Business Media, 2004.

Charlie Hou, Kiran K Thekumparampil, Giulia Fanti, and Sewoong Oh. Efficient algorithms for federated saddle point optimization. ar Xiv preprint ar Xiv:2102.06333, 2021.

Ruichen Jiang and Aryan Mokhtari. Generalized optimistic methods for convex-concave saddle point problems. ar Xiv preprint ar Xiv:2202.09674, 2022.

Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17 58, 2011.

Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020.

Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519 4529. PMLR, 2020.

Jakub Koneˇcn y, H Brendan Mc Mahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. Neur IPS Private Multi-Party Machine Learning Workshop, 2016.

G.M. Korpelevich. The extragradient method for finding saddle points and other problem. Ekonomika i Matematicheskie Metody, 12:C747 C756, 1976.

Published as a conference paper at ICLR 2024

Georgios Kotsalis, Guanghui Lan, and Tianjiao Li. Simple and optimal methods for stochastic variational inequalities, i: operator extrapolation. SIAM Journal on Optimization, 32(3):2041 2073, 2022.

D. Kovalev, Elnur Gasanov, Peter Richtárik, and Alexander V. Gasnikov. Lower bounds and optimal algorithms for smooth and strongly convex decentralized optimization over time-varying networks. In Neural Information Processing Systems, 2021a.

Dmitry Kovalev, Egor Shulgin, Peter Richtárik, Alexander V Rogozin, and Alexander Gasnikov. Adom: Accelerated decentralized optimization method for time-varying networks. In International Conference on Machine Learning, pp. 5784 5793. PMLR, 2021b.

Dmitry Kovalev, Aleksandr Beznosikov, Ekaterina Borodich, Alexander Gasnikov, and Gesualdo Scutari. Optimal gradient sliding and its application to optimal distributed optimization under similarity. Advances in Neural Information Processing Systems, 35:33494 33507, 2022.

Sucheol Lee and Donghwan Kim. Fast extra gradient methods for smooth structured nonconvexnonconcave minimax problems. Advances in Neural Information Processing Systems, 34:22588 22600, 2021.

Guoyin Li and Ting Kei Pong. Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4):2434 2460, 2015.

Li Li, Yuxi Fan, Mike Tse, and Kuo-Yi Lin. A review of applications in federated learning. Computers & Industrial Engineering, 149:106854, 2020a.

Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2020b. URL https://openreview.net/forum?id=HJx NAn Vt DS.

Tianyi Lin, Chi Jin, and Michael I Jordan. Near-optimal algorithms for minimax optimization. In Conference on Learning Theory, pp. 2738 2779. PMLR, 2020.

Changxin Liu, Zirui Zhou, Jian Pei, Yong Zhang, and Yang Shi. Decentralized composite optimization in stochastic networks: A dual averaging approach with linear convergence. IEEE Transactions on Automatic Control, 2022.

Weijie Liu, Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil, Zebang Shen, and Nenggan Zheng. A decentralized proximal point-type method for saddle point problems. OPT2020: 12th Annual Workshop on Optimization for Machine Learning, 2020.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r Jz IBf ZAb.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Aarti Singh and Jerry Zhu (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1273 1282. PMLR, 20 22 Apr 2017. URL https://proceedings.mlr.press/v54/mcmahan17a.html.

Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra(- gradient) mile. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=Bkg8jj C9KQ.

Konstantin Mishchenko, Dmitry Kovalev, Egor Shulgin, Peter Richtárik, and Yura Malitsky. Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, pp. 4573 4582. PMLR, 2020.

Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtárik. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! In International Conference on Machine Learning, pp. 15750 15769. PMLR, 2022.

Published as a conference paper at ICLR 2024

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574 2582, 2016. doi: 10.1109/CVPR.2016.282.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765 1773, 2017.

Angelia Nedich et al. Convergence rate of distributed averaging dynamics and optimization in networks. Foundations and Trends in Systems and Control, 2(1):1 100, 2015.

Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004.

Arkadij Semenoviˇc Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983.

Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103: 127 152, 2005.

Yurii Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming, 109(2-3):319 344, 2007.

Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221 259, 2009.

Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds of first-order methods for convexconcave bilinear saddle-point problems. Mathematical Programming, 185(1-2):1 35, 2021.

Leonid Denisovich Popov. A modification of the arrow-hurwicz method for search of saddle points. Mathematical notes of the Academy of Sciences of the USSR, 28:845 848, 1980.

Michael Rabbat. Multi-agent mirror descent for decentralized stochastic optimization. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp. 517 520. IEEE, 2015.

Ali Ramezani-Kebrya, Kimon Antonakopoulos, Igor Krawczuk, Justin Deschenaux, and Volkan Cevher. Distributed extra-gradient with optimal complexity and communication guarantees. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=b3it Jyar LM0.

R. Tyrrell Rockafellar. Convex Analysis. Princeton Landmarks in Mathematics and Physics. Princeton University Press, 1970. ISBN 978-1-4008-7317-3.

Alexander Rogozin, Aleksandr Beznosikov, Darina Dvinskikh, Dmitry Kovalev, Pavel Dvurechensky, and Alexander Gasnikov. Decentralized distributed optimization for saddle point problems. ar Xiv preprint ar Xiv:2102.07758, 2021.

Mher Safaryan, Rustem Islamov, Xun Qian, and Peter Richtarik. Fednl: Making newton-type methods applicable to federated learning. In International Conference on Machine Learning, pp. 18959 19010. PMLR, 2022.

Arda Sahiner, Tolga Ergen, Batu Ozturkler, Burak Bartan, John M. Pauly, Morteza Mardani, and Mert Pilanci. Hidden convexity of wasserstein GANs: Interpretable generative models with closed-form solutions. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=e2Lle5cij9D.

Ali Shafahi, Mahyar Najibi, Zheng Xu, John Dickerson, Larry S Davis, and Tom Goldstein. Universal adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 5636 5643, 2020.

Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pp. 1000 1008. PMLR, 2014.

Published as a conference paper at ICLR 2024

Pranay Sharma, Rohan Panda, Gauri Joshi, and Pramod Varshney. Federated minimax optimization: Improved convergence analyses and algorithms. In International Conference on Machine Learning, pp. 19683 19730. PMLR, 2022.

Pranay Sharma, Rohan Panda, and Gauri Joshi. Federated minimax optimization with client heterogeneity. ar Xiv preprint ar Xiv:2302.04249, 2023.

Yan Shen, Jian Du, Han Zhao, Benyu Zhang, Zhanghexuan Ji, and Mingchen Gao. Fedmm: Saddle point optimization for federated adversarial domain adaptation. In The 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2023.

Zhan Shi, Xinhua Zhang, and Yaoliang Yu. Bregman divergence for stochastic variance reduction: saddle-point and adversarial prediction. Advances in Neural Information Processing Systems, 30, 2017.

Mircea Sofonea and Andaluzia Matei. Variational inequalities with applications: a study of antiplane frictional contact problems, volume 18. Springer Science & Business Media, 2009.

Mikhail V. Solodov and Benar Fux Svaiter. A hybrid approximate extragradient proximal point algorithm using the enlargement of a maximal monotone operator. Set-Valued Analysis, 7:323 345, 1999.

Chaobing Song, Zhengyuan Zhou, Yichao Zhou, Yong Jiang, and Yi Ma. Optimistic dual extrapolation for coherent non-monotone variational inequalities. Advances in Neural Information Processing Systems, 33:14303 14314, 2020.

Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id= S1g2Jn Rc FX.

Gilbert Strang. Linear algebra and its applications. Belmont, CA: Thomson, Brooks/Cole, 2006.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267 288, 1996. ISSN 00359246. URL http://www. jstor.org/stable/2346178.

Vladislav Tominin, Yaroslav Tominin, Ekaterina Borodich, Dmitry Kovalev, Alexander Gasnikov, and Pavel Dvurechensky. On accelerated methods for saddle-point problems with composite structure. ar Xiv preprint ar Xiv:2103.09344, 2021.

Qianqian Tong, Guannan Liang, Tan Zhu, and Jinbo Bi. Federated nonconvex sparse learning. ar Xiv preprint ar Xiv:2101.00052, 2020.

Quoc Tran Dinh, Nhan H Pham, Dzung Phan, and Lam Nguyen. Feddr randomized douglasrachford splitting algorithms for nonconvex federated composite optimization. Advances in Neural Information Processing Systems, 34:30326 30338, 2021.

Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM Journal on Optimization, 2(3), 2008.

Hoi-To Wai, Zhuoran Yang, Zhaoran Wang, and Mingyi Hong. Multi-agent reinforcement learning via double averaging primal-dual optimization. Advances in Neural Information Processing Systems, 31, 2018.

Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan Mc Mahan, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, et al. A field guide to federated optimization. ar Xiv preprint ar Xiv:2107.06917, 2021.

Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pp. 10334 10343. PMLR, 2020a.

Blake E Woodworth, Kumar Kshitij Patel, and Nati Srebro. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33:6281 6292, 2020b.

Published as a conference paper at ICLR 2024

Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. The Journal of Machine Learning Research, 11:2543 2596, 2010.

Tesi Xiao, Xuxing Chen, Krishnakumar Balasubramanian, and Saeed Ghadimi. A one-sample decentralized proximal algorithm for non-convex stochastic composite optimization. In Robin J. Evans and Ilya Shpitser (eds.), Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216 of Proceedings of Machine Learning Research, pp. 2324 2334. PMLR, 31 Jul 04 Aug 2023. URL https://proceedings.mlr.press/v216/ xiao23a.html.

Jinming Xu, Ye Tian, Ying Sun, and Gesualdo Scutari. Distributed algorithms for composite optimization: Unified framework and convergence analysis. IEEE Transactions on Signal Processing, 69:3555 3570, 2021. doi: 10.1109/TSP.2021.3086579.

Yonggui Yan, Jie Chen, Pin-Yu Chen, Xiaodong Cui, Songtao Lu, and Yangyang Xu. Compressed decentralized proximal stochastic gradient method for nonconvex composite problems with heterogeneous data. In International Conference on Machine Learning, 2023.

Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5693 5700, 2019.

Honglin Yuan and Tengyu Ma. Federated accelerated stochastic gradient descent. Advances in Neural Information Processing Systems, 33:5332 5344, 2020.

Honglin Yuan, Manzil Zaheer, and Sashank Reddi. Federated composite optimization. In International Conference on Machine Learning, pp. 12253 12266. PMLR, 2021.

Fan Zhou and Guojing Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 3219 3227. International Joint Conferences on Artificial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/447. URL https://doi.org/10.24963/ijcai.2018/447.

Kang Zhou, Shenghua Gao, Jun Cheng, Zaiwang Gu, Huazhu Fu, Zhi Tu, Jianlong Yang, Yitian Zhao, and Jiang Liu. Sparse-gan: Sparsity-constrained generative adversarial network for anomaly detection in retinal oct image. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1227 1231. IEEE, 2020.

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23, 2010.

Published as a conference paper at ICLR 2024

In Appendix A, we provide details on experiment settings and additional experiments on the universal adversarial training of non-convex convolutional neural networks. In Appendix B, an extended literature review on various related subfields is included. Appendix C and D provide additional theoretical background, including relevant preliminaries, definitions, remarks, and technical lemmas. Appendix E, F, and G provide the convergence rates and complete proofs for Fe Dual Ex in federated composite saddle point optimization, federated composite convex optimization, sequential stochastic composite optimization, and sequential deterministic composite optimization respectively. Finally, the algorithm of Fed Mi P is presented in Appendix H.

A Additional Experiments and Setup Details 18

A.1 Setup Details for Saddle Point Optimization with Sparsity Regularization . . . . . 18

A.2 Saddle Point Optimization with Low-Rank Regularization . . . . . . . . . . . . . 19

A.3 Universal Adversarial Training of Logistic Regression . . . . . . . . . . . . . . . . 20

A.4 Universal Adversarial Training of Neural Networks . . . . . . . . . . . . . . . . . 21

B Extended Literature Review 21

B.1 Distributed Optimization / Federated Learning . . . . . . . . . . . . . . . . . . . . 21

B.2 Saddle Point Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

B.3 Composite Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

B.4 Other Tangentially Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C Additional Preliminaries, Definitions, and Remarks on Assumptions 23

C.1 Additional Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C.1.1 Mirror Descent and Dual Averaging . . . . . . . . . . . . . . . . . . . . . 24

C.1.2 Mirror Prox and Dual Extrapolation . . . . . . . . . . . . . . . . . . . . . 25

C.2 Additional Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

C.3 Formal Assumptions and Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D Additional Technical Lemmas 27

E Complete Analysis of Fe Dual Ex for Composite Saddle Point Problems 29

E.1 Main Theorem and Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

E.2 Helping Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

F Complete Analysis of Fe Dual Ex for Composite Convex Optimization 36

G Fe Dual Ex in Other Settings 39

G.1 Stochastic Dual Extrapolation for Composite Saddle Point Optimization . . . . . . 39

G.2 Deterministic Dual Extrapolation for Composite Saddle Point Optimization . . . . 41

H Federated Mirror Prox 42

Published as a conference paper at ICLR 2024

A ADDITIONAL EXPERIMENTS AND SETUP DETAILS

A.1 SETUP DETAILS FOR SADDLE POINT OPTIMIZATION WITH SPARSITY REGULARIZATION

We provide additional details for the SPP with the sparsity regularization demonstrated in the main text. We start by restating its formulation below:

min x X max y Y Ax b, y + λ x 1 λ y 1

A Rn m, X = {Rm : x D}, b Rn, Y = {Rn : y D}.

Soft-Thresholding Operator for ℓ1 Norm Regularization. By choosing the distance-generating function to be ℓ= 1

2 x 2 2 + 1

2 y 2 2, the projection ℓ r,k( ) instantiates to the following element-wise soft-thresholding operator (Hastie et al., 2015; Jiang & Mokhtari, 2022):

(|ω| λ ) sgn(ω) if λ < |ω| λ + D D sgn(ω) otherwise ,

in which λ = ληc(ηsr K + k).

Closed-Form Duality Gap. The closed-form duality gap is given by

Gap(x, y) = D (|Ax b| λ)+ 1 + λ x 1 + D (|A y| λ)+ 1 + b, y + λ y 1,

where | | and ()+ = max{ , 0} are element-wise. We provide a brief derivation below. Since a constraint is equivalent to an indicator regularization, we move the ℓ constraint into the objective

and denote g1( ) = 1, g2( ) = 0 if D otherwise . By the definitions of duality gap in

Definition 1 and convex conjugate in Definition 9, the duality gap equals to

Gap(x, y) = max y λ{ 1

λ(Ax b), y g1(y) g2(y) + x 1}

λ(A y), x + g1(x) + g2(x) y 1 1

= λ(g1 + g2) ( 1

λ(Ax b)) + λ(g1 + g2) ( 1

λ(A y)) + λ x 1 + λ y 1 + b y

= inf u {λg 1(u) + λg 2( 1

λ(Ax b) u)} + inf v {λg 1(v) + λg 2( 1

+ λ x 1 + λ y 1 + b y,

in which the last equality holds by Theorem 2.3.2, namely infimal convolution, in Chapter E of Hiriart-Urruty & Lemaréchal (2004). By definition of the convex conjugate, the convex conjugate

of a norm g( ) = p is defined to be g ( ) = 0 if q 1 otherwise , in which q is the dual

norm of p. Given that ℓ1 and ℓ are dual norms to each other, g 1( ) = 0 if 1 otherwise ,

g 2( ) = D 1. Therefore the infimum is achieved when i [m], j [n],

λ(Ax b)i if | 1

λ(Ax b)i| 1 sgn( 1

λ(Ax b)i) otherwise , vj = 1

λ(A y)j if | 1

λ(A y)j| 1 sgn( 1

λ(A y)j) otherwise ,

which yields the closed-form duality gap.

Additional Experiment Details. We generate a fixed pair of A and b with each entry independently following the uniform distribution U[ 1,1]. Each entry of the variables x and y is initialized independently from the distribution U[ D,D]. As in (Jiang & Mokhtari, 2022), we take m = 600, n = 300, λ = 0.1, D = 0.05. For DO, we simulate M = 100 clients. For the gradient query of each client

Published as a conference paper at ICLR 2024

in each local update, we inject a Gaussian noise from N(0, σ2). All M = 100 clients participate in each round; noise on each client is i.i.d. with σ = 0.1.

We only tune the global step size ηs and the local step size ηc. For all experiments, the parameters are searched from the combination of ηs {1, 3e 1, 1e 1, 3e 2, 1e 2} and ηc {1, 3e 1, 1e 1, 3e 2, 1e 2, 3e 3, 1e 3}. We run each setting for 10 different random seeds and report the mean and standard deviation in Figure 4.

A.2 SADDLE POINT OPTIMIZATION WITH LOW-RANK REGULARIZATION

We test Fe Dual Ex on the following SPP with nuclear norm regularization for low-rankness, in which we overuse the notation for the matrix nuclear norm and 2 for the matrix spectral norm. We use Tr( ) to denote the trace of a square matrix. And for the purpose of feasibility and convenience, we impose spectral norm constraints on the variables as well.

min X X max Y Y Tr (AX B) Y + λ X λ Y

A Rn m, X = {Rm p : X 2 D},

B Rn p, Y = {Rn p : Y 2 D}.

Soft-Thresholding Operator for Nuclear Norm Regularization. By choosing the distancegenerating function to be ℓ= 1

2 X 2 F + 1

2 Y 2 F where F denotes the Frobenius norm, the projection ℓ r,k( ) instantiates to the following element-wise singular value soft-thresholding operator (Cai et al., 2010):

Tλ (W) := UTλ (Σ)V , Tλ (Σ) = diag(sgn(σi(W)) min{max{σi(W) λ , 0}, D}),

in which λ = ληc(ηsr K + k), W = UΣV is the singular value decomposition (SVD) of W, and we overuse the notation σi( ) to represent the singular values.

Closed-Form Duality Gap. The closed-form duality gap is given by

Gap(X, Y) = D diag (|σi(AX B)| λ)+ + λ X + D diag (|σj(A Y)| λ)+ + Tr B Y + λ Y ,

We provide a brief derivation below. Since a constraint is equivalent to an indicator regularization, we move the spectral norm constraint into the objective and denote g1( ) = ,

g2( ) = 0 if 2 D otherwise . By the definitions of duality gap in Definition 1 and convex conjugate

in Definition 9, the duality gap equals to

Gap(X, Y) = max Y λ{Tr 1

λ(AX B) Y g1(Y) g2(Y) + X }

min X λ{{Tr 1

λ(A Y) X + g1(X) + g2(X) Y 1

= λ(g1 + g2) ( 1

λ(AX B)) + λ(g1 + g2) ( 1

+ λ X + λ Y + Tr B Y

= inf P {λg 1(P) + λg 2( 1

λ(AX B) P)} + inf Q {λg 1(Q) + λg 2( 1

+ λ X + λ Y + Tr B Y ,

in which the last equality holds by Theorem 2.3.2, namely infimal convolution, in Chapter E of Hiriart-Urruty & Lemaréchal (2004). By definition of the dual norm, we know that the nuclear

norm and the spectral norm are dual norms to each other. Therefore, g 1( ) = 0 if 2 1 otherwise ,

Published as a conference paper at ICLR 2024

g 2( ) = D . And the infimum is achieved when

σi(P) = σi 1

λ(Ax B) if |σi 1

λ(Ax B) | 1 sgn σi 1

λ(Ax B) otherwise ,

σj(Q) = σj 1

λ(A y) if |σj 1

λ(A y) | 1 sgn σj 1

λ(A y) otherwise ,

which yields the closed-form duality gap.

Experiment Settings. We generate a fixed pair of A and B. Each entry of A and half of the columns in B follows the uniform distribution U[ 1,1] independently. Each entry of the variables X and Y is initialized independently from the distribution U[ 1,1]. We take m = 600, n = 300, p = 20, λ = 0.1, D = 0.05. For DO, we simulate M = 100 clients. For the gradient query of each client in each local update, we inject a Gaussian noise from N(0, σ2). All M = 100 clients participate in each round; noise on each client is i.i.d. with σ = 0.1. We only tune the global step size ηs and the local step size ηc. For all experiments, the parameters are searched from the combination of ηs {1, 3e 1, 1e 1, 3e 2, 1e 2} and ηc {10, 3, 1, 3e 1, 1e 1, 3e 2, 1e 2, 3e 3, 1e 3}. We run each setting for 10 different random seeds and plot the mean and the standard deviation.

We evaluate the convergence in terms of the duality gap and also demonstrate the rank of the solution, for both X and Y. For the feasibility of low-rankness, we generate B to be of rank p

2, i.e. half of the columns of B is linearly dependent on the other half. With p = 20, the optimal rank for the solution would most likely be 10. The evaluation is conducted for two different settings: (a) K = 1 local update for R = 100 rounds; (b) K = 10 local updates for R = 20 rounds. The results are demonstrated in Figure 5 correspondingly.

Discussions. From Figure 5, we can see that in the setting for low-rankness regularization, dual methods tend to perform better both in minimizing the duality gap and in encouraging a low-rank solution. In particular, Fe Dual Ex, as a method geared for saddle point optimization, demonstrates better convergence in the duality gap than Fed Dual Avg. In the meantime, the solution given by Fe Dual Ex quickly reaches the optimal rank of 10. This further reveals the potential of Fe Dual Ex in coping with a variety of regularization and constraints.

A.3 UNIVERSAL ADVERSARIAL TRAINING OF LOGISTIC REGRESSION

We provide the problem formulation and detailed experiment setting for the universal adversarial training of logistic regression demonstrated in the main text.

Problem Formulation. As introduced, we impose an l1 regularization on the attack to encourage sparsity in addition to the ball constraint. The problem can be formulated as the following SPP:

min w Rd max δ D 1 n

i=1 ℓ(w (xi + δ), yi) + λ δ 1

in which ℓis the cross-entropy loss for multiclass logistic regression; w Rd is the parameter; xi Rd is the data and yi is the label; δ Rd is the attack.

Experiment Settings. The training data for MNIST is evenly distributed across M = 100 clients, each possessing 600. The client makes K = 5 local updates and communicates for R = 20 rounds. For the CIFAR-10 experiments, each of the 100 clients holds 500 of the training data. The client makes K = 5 local updates and communicates for R = 40 rounds. D = 0.05 for data normalized between 0 and λ = 0.1. Validation is done on the whole validation dataset on the server with unattacked data. As before, the hyper-parameters are searched from the combination of ηs {1, 3e 1, 1e 1, 3e 2, 1e 2} and ηc {10, 3, 1, 3e 1, 1e 1, 3e 2, 1e 2, 3e 3, 1e 3}. We run each setting for 10 different random seeds and plot the mean and the standard deviation in Figure 6.

Attack Visualization. The attack for MNIST has only one channel and is directly visualized with the color map from blue to red rescaled between the range of the attack, with blue being negative, red being positive, and purple being zero. The attack for CIFAR-10 contains 3 channels and can be directly visualized with RGB mode rescaled between 0 and 255. For the attack to be visible, we divide the value by its maximum then times the result by 4.

Published as a conference paper at ICLR 2024

0 25 50 75 Communication Rounds

1.4 100 1.6 100 1.8 100

2 100 2.2 100

0 25 50 75 Communication Rounds

Projected Gradient Descent Ascent Fe Dual Ex

Figure 8: Training loss and validation accuracy of 3-layer CNN on unattacked data.

A.4 UNIVERSAL ADVERSARIAL TRAINING OF NEURAL NETWORKS

Even though the theoretical result is derived with respect to convex functions, we experimentally demonstrate the convergence Fe Dual Ex for non-convex functions with the adversarial training of neural networks on CIFAR-10. The model tested is a 3-layer convolutional neural network (CNN) with 16, 32, and 64 filters of size 3 3, each layer followed by a relu activation and a 2 2 max-pooling. The performance is demonstrated in Figure 8. The loss value is by no means an exact reflection of the duality gap, nevertheless, Fe Dual Ex also converges for non-convex functions, yielding faster numerical convergence and better-hardened models in terms of validation accuracy on unattacked data. In addition, the sparsity of the attack generated by Fe Dual Ex is 50.38%, whereas that by the vanilla distributed version of projected gradient descent ascent is 99.31%.

B EXTENDED LITERATURE REVIEW

B.1 DISTRIBUTED OPTIMIZATION / FEDERATED LEARNING

In recent years, distributed learning has received increasing attention in practice and theory. Earlier works in the field were known as parallel (Zinkevich et al., 2010) or local (Zhou & Cong, 2018; Stich, 2019), which are later recognized as the homogeneous case, where data across clients are assumed to be balanced and i.i.d. (independent and identically distributed), of federated learning (FL), specifically, Federated Averaging (Fed Avg) (Mc Mahan et al., 2017), DO or FL has been found appealing in various applications (Li et al., 2020a). On the theoretical front, Stich (2019) provides the first convergence rate for Local SGD, or, Fed Avg under the homogeneous setting. The distributed optimization paradigm we consider aligns with that in Local SGD (Stich, 2019). The rate for Local SGD has been improved with tighter analysis (Haddadpour et al., 2019; Khaled et al., 2020; Woodworth et al., 2020a; Glasgow et al., 2022) and acceleration techniques (Yuan & Ma, 2020; Mishchenko et al., 2022). Others also analyze Fed Avg under heterogeneity (Haddadpour et al., 2019; Khaled et al., 2020; Woodworth et al., 2020b) and non-i.i.d. data (Li et al., 2020b) or in light propose improvements (Karimireddy et al., 2020). Recently, the idea of DO is further extended to higher-order methods (Bullins et al., 2021; Gupta et al., 2021; Safaryan et al., 2022). Due to the page limit, we refer the readers to (Cao et al., 2023; Wang et al., 2021; Kairouz et al., 2021) for more comprehensive reviews of DO and FL. In the meantime, we point out that none of the work mentioned above covers saddle point problems or non-smooth composite or constrained problems. For distributed saddle point optimization and federated composite optimization, we defer to the following subsections.

B.2 SADDLE POINT OPTIMIZATION

The study of Saddle Point Optimization dates back to the very early gradient descent ascent (Arrow et al., 1958). It was later improved by the important ideas of extra-gradient (Korpelevich, 1976) and optimism (Popov, 1980). In light of these ideas, many algorithms were proposed for SPP (Solodov & Svaiter, 1999; Nemirovski, 2004; Nesterov, 2007; Chambolle & Pock, 2011; Mertikopoulos et al., 2019; Jiang & Mokhtari, 2022). Among them, in the convex-concave setting in particular, the most relevant and prominent ones are Nemirovski s mirror prox Nemirovski (2004) and Nesterov s dual extrapolation Nesterov (2007). They generalize respectively Mirror Descent (Nemirovskij & Yudin, 1983) and Dual Averaging (Nesterov, 2009) from convex optimization to monotone variational inequalities (VIs) which include SPP as one realization. Along with Tseng s Accelerated Proximal

Published as a conference paper at ICLR 2024

Gradient (Tseng, 2008), they are the three methods that converge to an ϵ-approximate solution in terms of duality gap at O( 1

T ), the known best rate for a general convex-concave SPP (Ouyang & Xu, 2021; Lin et al., 2020). Mirror prox inspired many papers (Antonakopoulos et al., 2019; Chen et al., 2020) and is later extended to the stochastic setting (Juditsky et al., 2011; Mishchenko et al., 2020), the higher-order setting (Bullins & Lai, 2022), and even the composite setting (He et al., 2015), whose introduction we defer to the review of composite optimization. Dual extrapolation is later extended to non-monotone VIs (Song et al., 2020), yet its stochastic and composite versions are, to the best of our knowledge, not found. Kotsalis et al. (2022) recently studied optimal methods for stochastic variational inequalities, yet their result is limited to smooth VIs, not composite ones.

From the perspective of distributed optimization, several works have made preliminary progress for smooth and unconstrained SPP in the Euclidean space. Beznosikov et al. (2020) investigate the distributed extra-gradient method under various conditions and provide upper and lower bounds under strongly-convex strongly-concave and non-convex non-concave assumptions. Hou et al. (2021) propose Fed Avg-S and SCAFFOLD-S based on Fed Avg (Mc Mahan et al., 2017) and SCAFFOLD (Karimireddy et al., 2020) for SPP, which achieves similar convergence rate to the distributed extragradient algorithm (Beznosikov et al., 2020) under the strong-convexity-concavity assumption. In addition, (Ramezani-Kebrya et al., 2023) studies the problem from the information compression perspective with the measure of communication bits. The topic of distributed or federated saddle point optimization is also found in recent applications of interest, e.g. adversarial domain adaptation (Shen et al., 2023). Yet, none of the existing works includes the study for SPP with constraints or composite possibly non-smooth regularization. Outside of our setting, Borodich et al. (2023) also studies composite SPP, but assumes composite terms to be smooth as well.

B.3 COMPOSITE OPTIMIZATION

Composite optimization has been an important topic due to its reflection of real-world complexities. Representative works include composite mirror descent (Duchi et al., 2010) and regularized dual averaging (Xiao, 2010; Flammarion & Bach, 2017) that generalize mirror descent (Nemirovskij & Yudin, 1983) and dual averaging (Nesterov, 2009) in the context of composite convex optimization. Composite saddle point optimization, in comparison, appears dispersedly in early-day problems in practice (Buades et al., 2005; Aujol & Chambolle, 2005), often as a primal-dual reformulation of composite convex problems. Solving techniques such as smoothing (Nesterov, 2005) and primal-dual splitting (Combettes & Pesquet, 2012) were proposed, and numerical speed-ups were studied (He & Monteiro, 2015; 2016), while systematic convergence analysis on general composite SPP came later in time (He et al., 2015; Chambolle & Pock, 2016; Jiang & Mokhtari, 2022). Recently, Tominin et al. (2021); Borodich et al. (2022) also proposed acceleration techniques for composite SPP.

Most related among them, the pioneering composite mirror prox (Co MP) (He et al., 2015) constructs auxiliary variables for the composite regularization terms as an upper bound and thus moves the non-smooth term into the problem domain. Observing that the gradient operator for the auxiliary variable is constant, Co MP operates as if there were no composite components at all (He et al., 2015), and exhibits a O( 1

T ) convergence rate that matches its smooth version (Nemirovski, 2004). In the stochastic setting, Mishchenko et al. (2020) analyzed a variant of stochastic mirror prox (Juditsky et al., 2011), which is then capable of handling composite terms in the Euclidean space. In this paper, we take a different approach that utilizes the generalized Bregman divergence and get the same rate for composite dual extrapolation.

For distributed composite optimization with local updates, Yuan et al. (2021) study Federated Mirror Descent, a natural extension of Fed Avg that adapts to composite optimization under the convex setting. Along the way, they identified the curse of primal averaging specific to composite optimization in the DO paradigm, where the regularization-imposed structure on the client models may no longer hold after server primal averaging. To resolve this issue, they further proposed Federated Dual Averaging which brings the averaging step to the dual space. Tran Dinh et al. (2021) proposes a federated Douglas-Rachford splitting algorithm for nonconvex composite optimization. On the less related constrained optimization topic, Tong et al. (2020) proposed a federated learning algorithm for nonconvex sparse learning under ℓ0 constraint. To the best of our knowledge, the field of distributed optimization for composite SPP remains blank, which we regard as the main focus of this paper.

Published as a conference paper at ICLR 2024

B.4 OTHER TANGENTIALLY RELATED WORK

Decentralized Optimization. Parallel to FL or DO with local updates, there is another line of work that studies decentralized optimization or consensus optimization over networks, in which machines communicate directly with each other based on their topological connectivity (Nedich et al., 2015). Classic algorithms mentioned previously are widely applied as well under this paradigm, for example, decentralized mirror descent (Rabbat, 2015) and decentralized (composite) dual averaging over networks (Duchi et al., 2011; Liu et al., 2022). Further in the context of composite optimization, Yan et al. (2023); Xiao et al. (2023) focus on composite non-convex objectives under the decentralized setting. Saddle point optimization has also been studied for decentralized optimization, including for proximal point-type methods (Liu et al., 2020) and extra-gradient methods (Rogozin et al., 2021; Beznosikov et al., 2021; 2022). In particular, Rogozin et al. (2021) studies decentralized mirror prox in the Euclidean space. We would like to point out that mirror prox in the Euclidean space reduces to vanilla extra-gradient methods. In addition, Aybat & Yazdandoost Hamedani (2016); Xu et al. (2021) study the saddle point reformulation for composite convex objectives over decentralized networks, which essentially focus on composite convex optimization. In the general context of distributed learning of composite SPP, by the judgment of the authors, we came across no paper in decentralized optimization similar to ours. More importantly, decentralized optimization focuses on topics like time-varying network topology (Kovalev et al., 2021a;b) or gossip schema (Dimakis et al., 2006), which are fundamentally different from our setting in terms of motivations, communication protocols, and techniques (Kairouz et al., 2021).

Nonconvex-Nonconcave Saddle Point Problems. For nonconvex-nonconcave SPP, several distributed learning methods have recently been proposed, including extra-gradient methods (Lee & Kim, 2021) and the Local Stochastic Gradient Descent Ascent (Local SGDA) (Sharma et al., 2022; 2023). Yet we emphasize that our object of analysis is composite SPP with possibly non-smooth regularization, and as remarked by Yuan et al. (2021), non-convex optimization for composite possibly non-smooth functions is in itself intricate even for sequential optimization, involving additional assumptions and sophisticated algorithm design (Li & Pong, 2015; Bredies et al., 2015), let alone distributed learning of SPP. Thus we focus on convex-concave analysis in this paper.

Finite-Sum Optimization with Function Similarity. Another line of work considers finite-sum optimization with function similarity, following the setting similar to DANE (Shamir et al., 2014). In this setting, each machine is assumed to maintain a fixed set of data so that the functions across machines can be δ-similar with a high probability by large-sample concentration inequalities. In the context of distributed saddle points optimization, examples include (Kovalev et al., 2022) and (Beznosikov & Gasnikov, 2023). This setting is significantly different from ours because we do not consider δ-similarity, and our optimization procedure is presented in an online scheme. In (Beznosikov & Gasnikov, 2023) in particular, local steps are also considered, but we would note that Beznosikov & Gasnikov (2023) require the server to take local steps instead of the clients, which also requires the presence of data on the server. This is done by making the first client the server and is in line with the setting in DANE (Shamir et al., 2014). In contrast, the server in our setting only aggregates the model and does not access any data, which also aligns with the privacy-preserving purpose in FL.

C ADDITIONAL PRELIMINARIES, DEFINITIONS, AND REMARKS ON ASSUMPTIONS

In this section, we provide supplementary theoretical backgrounds for the algorithm and the convergence analysis of Fe Dual Ex. We start by providing a more detailed introduction to the related algorithms, then list additional definitions necessary for the analysis. Before moving on to the main proof for Fe Dual Ex, we state formally the assumptions made and provide additional remarks on the assumptions that better link them to their usage in the proof.

C.1 ADDITIONAL PRELIMINARIES

To make this paper as self-contained as possible, in this section, we provide a brief overview of mirror descent, dual averaging, and their advancement in saddle point optimization, i.e., mirror prox and

Published as a conference paper at ICLR 2024

dual extrapolation. More comprehensive introductions can be found in the original papers and in (Bubeck et al., 2015; Cohen et al., 2021). We slide into mirror descent from the simple and widely known projected gradient descent, namely vanilla gradient descent with constraint, therefore plus another projection of the updated sequence back to the feasible set.

C.1.1 MIRROR DESCENT AND DUAL AVERAGING

We start by introducing projected gradient descent. Projected gradient descent first takes the gradient update, then projects the updated point back to the constraint by finding a feasible solution within the constraint that minimizes its Euclidean distance to the current point. The updating sequence is given below: t [T], xt X whereas not necessarily for x t,

x t+1 = xt ηg(xt)

xt+1 = arg min x X

1 2 x x t+1 2 2.

Mirror Descent (Nemirovskij & Yudin, 1983). Mirror descent generalizes projected gradient descent to non-Euclidean space with the Bregman divergence (Bregman, 1967). We provide the definition of the Bregman divergence below. Definition 5 (Bregman Divergence (Bregman, 1967)). Let h : Rd R { } be a prox function or a distance-generating function that is closed, strictly convex, and differentiable in int dom h. The Bregman divergence for x dom h and y int dom h is defined to be

V h y (x) = h(x) h(y) h(y), x y .

Mirror descent regards h as a mirror map to the dual space, and follows the procedure below:

h(x t+1) = h(xt) ηg(xt)

xt+1 = arg min x X V h x t+1(x).

By choosing h( ) = 1

2 2 2 in the Euclidean space whose dual space is itself, mirror descent reduces to projected gradient descent.

Mirror descent can be presented from a proximal point of view, or in the online setting as in Beck & Teboulle (2003):

xt+1 = arg min x X ηg(xt), x + V h xt(x).

Such proximal operation with Bregman divergence is studied by others (Censor & Zenios, 1992), and is recently represented by a neatly defined proximal operator (Cohen et al., 2021). Definition 6 (Proximal Operator (Cohen et al., 2021)). The Bregman divergence defined proximal operator is given by

Prox h x ( ) := arg min x X { , x + V h x (x)}.

In this spirit, the mirror descent algorithm can be written with one proximal operation:

xt+1 = Prox h xt(ηg(xt)).

Composite Mirror Descent (Duchi et al., 2010). Mirror descent was later generalized to composite convex functions, i.e., the ones with regularization. The key modification is to include the regularization term in the proximal operator, yet not linearize the regularization term, since it could be non-smooth and thus non-differentiable. The updating sequence is given by

xt+1 = arg min x X ηg(xt), x + V h xt(x) + ηψ(x).

It can also be represented with a composite mirror map as in (Yuan et al., 2021):

xt+1 = (h + ηψ) ( h(xt) ηg(xt)).

Published as a conference paper at ICLR 2024

Dual Averaging (Nesterov, 2009). Compared with mirror descent, dual averaging moves the updating sequence to the dual space. The procedure of dual averaging is as follows (Bubeck et al., 2015):

h(x t+1) = h(x t) ηg(xt)

xt+1 = arg min x X V h x t+1(x),

or equivalently as presented in (Nesterov, 2009) with the sequence of dual variables: t [T], xt X, µt X ,

µt+1 = µt ηg(xt) xt+1 = h (µt+1).

This can be further simplified to

xt+1 = arg min x X η

τ=0 g(xt), x + h(x).

Composite Dual Averaging (Xiao, 2010). Around the same time as composite mirror descent, composite dual averaging, also known as regularized dual averaging, was proposed with a similar idea of including the regularization term in the proximal operator. As presented in the original paper (Xiao, 2010):

xt+1 = arg min x X η

τ=0 g(xτ), x + ηβth(x) + tηψ(x),

in which {βt}t 1 is a non-negative and non-decreasing input sequence. Flammarion & Bach (2017) adopted the case with constant sequence βt = 1

xt+1 = arg min x X η

τ=0 g(xτ), x + h(x) + tηψ(x),

and equivalently with composite mirror map:

µt+1 = µt ηg(xt) xt+1 = (h + tηψ) (µt+1),

which is also presented in (Yuan et al., 2021).

C.1.2 MIRROR PROX AND DUAL EXTRAPOLATION

Mirror Prox (Nemirovski, 2004). Mirror prox generalizes the extra-gradient method to non Euclidean space as mirror descent compared with projected gradient descent. It was proposed for variational inequalities (VIs), including SPP. We first present the corresponding Bregman divergence in the saddle point setting, whose definition was not included in detail in (Nemirovski, 2004) but was later more clearly stated in (Nesterov, 2007; Shi et al., 2017). Definition 7 (Bregman Divergence for Saddle Functions (Nesterov, 2007)). Let ℓ: X Y R { } be a distance-generating function that is closed, strictly convex, and differentiable in int dom ℓ. For z = (x, y) Z = X Y, the function and its gradient are defined as

ℓ(z) = h1(x) + h2(y), ℓ(z) = xh1(x) yh2(y)

The Bregman divergence for z = (x, y) dom ℓand z = (x , y ) int dom ℓis defined to be

V ℓ z (z) := ℓ(z) ℓ(z ) ℓ(z ), z z .

Notice that our notion of ℓis not a saddle function, slightly different from that in Shi et al. (2017), but the Bregman divergence defined is the same as Eq. (6) in Shi et al. (2017) and Eq. (4.9) in Nesterov (2007).

Published as a conference paper at ICLR 2024

Mirror prox can also be viewed as an extra-step mirror descent. Most intuitively, by introducing an intermediate variable zt+1/2, its procedure is as follows:

h(z t+1/2) = h(zt) ηg(zt)

zt+1/2 = arg min z Z V h z t+1/2(z)

h(z t+1) = h(zt) ηg(zt+1/2)

zt+1 = arg min z Z V h z t+1(z).

And it can be represented with the proximal operator in Definition 6 as well. Following (Cohen et al., 2021), t [T], zt, zt+1/2 Z,

zt+1/2 = Prox ℓ zt(ηg(zt))

zt+1 = Prox ℓ zt(ηg(zt+1/2)).

Dual Extrapolation (Nesterov, 2007). As in dual averaging, dual extrapolation moves the updating sequence of mirror prox to the dual space. Slightly different from a two-step dual averaging, dual extrapolation further initialize a fixed point in the primal space z, and as presented in (Cohen et al., 2021), its procedure is as follows: t [T], zt, zt+1/2 Z, ωt Z ,

zt = Prox ℓ z(ωt)

zt+1/2 = Prox ℓ zt(ηg(zt))

ωt+1 = ωt + ηg(zt+1/2).

The updating sequence presented above is equivalent to that defined in the original paper (Nesterov, 2007), simply replacing the arg max with arg min, and the dual variables with its additive inverse in the dual space.

C.2 ADDITIONAL DEFINITIONS

In this subsection, we list additional definitions involved in the theoretical analysis in subsequent sections. Definition 8 (Legendre function (Rockafellar, 1970)). A proper, convex, closed function h : Rd R { } is called a Legendre function or a function of Legendre-type if (a) h is strictly convex; (b) h is essentially smooth, namely h is differentiable on int dom h, and h(xt) for every sequence {xt} t=0 int dom h converging to a boundary point of dom h as t . Definition 9 (Convex Conjugate or Legendre Fenchel Transformation (Boyd & Vandenberghe, 2004)). The convex conjugate of a function h is defined as

h(s) = sup z { s, z h(z)}.

Definition 10 (Differentiability of the conjugate of strictly convex function (Chapter E, Theorem 4.1.1 in Hiriart-Urruty & Lemaréchal (2004))). For a strictly convex function h, int dom h = and h is continuously differentiable on int dom h , with gradient defined as:

h (s) = arg min z { s, z + h(z)} (5)

C.3 FORMAL ASSUMPTIONS AND REMARKS

In this subsection, we state the assumptions formally and provide additional remarks that may help in understanding the theoretical analysis. Assumption 1 (Assumptions on the objective function). For the composite saddle function ϕ(z) = f(x, y) + ψ1(x) ψ2(y) = 1 M PM m=1 fm(x, y) + ψ1(x) ψ2(y), we assume that

a.(Local Convexity of f) m [M], fm(x, y) is convex in x and concave in y.

b.(Convexity of ψ) ψ1(x) is convex in x, and ψ2(y) is convex in y.

Published as a conference paper at ICLR 2024

Assumption 2 (Assumptions on the gradient operator). For f in the objective function, its gradient operator is given by g = h xf yf i . By the linearity of gradient operators, g = 1 M PM m=1 gm, and we assume that

a.(Local Lipschitzness of g) m [M], gm(z) = h xfm(x,y) yfm(x,y) i is β-Lipschitz:

gm(z) gm(z ) β z z

b.(Local Unbiased Estimate and Bounded Variance) For any client m [M], the local gradient queried by some local random sample ξm is unbiased and also bounded in variance, i.e., Eξ[gm(zm; ξm)] = gm(zm), and

Eξ gm(zm; ξm) gm(zm) 2 σ2

c.(Bounded Gradient) m [M],

gm(zm; ξm) G

Assumption 3 (Assumption on the distance-generating function). The distance-generating function h is a Legendre function that is 1-strongly convex, i.e., x, y,

h(y) h(x) h(x), y x 1

Assumption 4. The domain of the optimization problem Z is compact in terms of Bregman Divergence, i.e., z, z Z, V ℓ z (z) B. Remark 1. An immediate result of Assumption 1a is that, z = (x, y), z = (x , y ) Z

f(x , y ) f(x, y ) xf(x , y ), x x ,

f(x , y) f(x , y ) yf(x , y ), y y .

Summing them up,

f(x , y) f(x, y ) g(z ), z z .

Remark 2. For any sequence of i.i.d. random variables ξm 0,0, ξm 0,1/2, ..., ξm 1,0, ξm 1,1/2, ..., ξm r,k, ξm r,k+1/2, let Fr,k denote the σ-field generated by the set {ξm j,t : m [M] and ((j = r, t k) or (j < r, k {0, 1/2, ..., K 1, K 1/2}))}. Then any ξm r,k is independent of Fr,k 1/2, and Assumption 2b implies

EFr,k gm(zm r,k; ξm r,k) gm(zm r,k) 2 | Fr,k 1/2 σ2.

Remark 3 (Corollary 23.5.1. and Theorem 26.5. in Rockafellar (1970)). For a closed convex (not necessarily differentiable) function h, h is the inverse of h in the sense of multi-valued mappings, i.e., z h (ς) if and only if ς h(z). Furthermore, if h is of Legendre-type, meaning it is essentially strictly convex and essentially smooth, then h yields a well-defined h that acts as a bijection, i.e., ( h) 1 = h . Remark 4. Assumption 3 and Remark 3 also trivially hold for ℓfrom Definition 7 in the saddle point setting, and eventually, the generalized distance-generating function ℓt from Definition 3. Due to the strong convexity of ℓt, ℓ t is well-defined as noted in Definition 10. Together with the potential non-smoothness of ℓt, Remark 3 implies that z = ℓ t (ς) if and only if ς ℓt(z).

D ADDITIONAL TECHNICAL LEMMAS

In this section, we list some technical lemmas that are referenced in the proofs of the main theorem and its helping lemmas. Lemma 4 (Jensen s inequality). For a convex function φ(x), variables x1, ..., xn in its domain, and positive weights a1, ..., an,

φ Pn i=1 aixi Pn i=1 ai

Pn i=1 aiφ(xi) Pn i=1 ai ,

and the inequality is reversed if φ(x) is concave.

Published as a conference paper at ICLR 2024

Lemma 5 (Cauchy-Schwarz inequality (Strang, 2006)). For any x and y in an inner product space,

Lemma 6 (Young s inequality (Lemma 1.45. in Sofonea & Matei (2009))). Let p, q R be two conjugate exponents, that is 1 < p < , and 1

q = 1. Then a, b 0,

Lemma 7 (AM-QM inequality). For any set of positive integers x1, ..., xn,

i=1 x2 i . (6)

Lemma 8 (Lemma 2.3 in Jiang & Mokhtari (2022)). Suppose Assumption 1 and 2 hold, then z = (x, y), z1, ..., z T Z and θ1, ..., θT 0 with PT t=1 θt = 1, we have

t=1 θtxt, y) ϕ(x,

t=1 θt[ g(zt), zt z + ψ(zt) ψ(z)],

in which ψ(z) = ψ1(x) + ψ2(y).

Proof. For ψ(z) = ψ1(x) + ψ2(y),

ϕ(xt, y) ϕ(x, yt) = f(xt, y) + ψ1(xt) ψ2(y) f(x, yt) ψ1(x) + ψ2(yt) = f(xt, y) f(x, yt) + ψ(zt) ψ(z) g(zt), zt z + ψ(zt) ψ(z),

where the inequality holds by convexity-concavity of f(x, y), i.e. Remark 1. Then sum the inequality over t = 1, ..., T,

t=1 ϕ(θtxt, y)

t=1 ϕ(x, θtyt)

g(zt), zt z + ψ(zt) ψ(z) .

Finally, by Jensen s inequality in Lemma 4,

t=1 ϕ(θtxt, y) ϕ T X

t=1 θtxt, y ,

t=1 ϕ(x, θtyt) ϕ x,

which completes the proof.

Lemma 9 (Theorem 4.2.1 in Hiriart-Urruty & Lemaréchal (2004)). The conjugate of an α-strongly convex function is 1

α-smooth. That is, for h that is strongly convex with modulus α > 0, x, x ,

h (x) h (x ) 1

Lemma 10 (Lemma 2 in Flammarion & Bach (2017)). Generalized Bregman divergence upperbounds the Bregman divergence. That is, under Assumption 1 and 3, x dom h, µ int dom h t where ht = h + tηψ,

V ht µ (x) V h x (x),

in which x = h t (µ ).

Published as a conference paper at ICLR 2024

E COMPLETE ANALYSIS OF FEDUALEX FOR COMPOSITE SADDLE POINT PROBLEMS

We begin by reformulating the updating sequences with another pair of auxiliary dual variables. Expand the prox operator in Algorithm 1 line 6 to 8 by Definition 4, and rewrite by the gradient of the conjugate function in Definition 10,

zm r,k = arg min z { ςm r,k ς, z + ℓr,k(z)} = ℓ r,k( ς ςm r,k)

zm r,k+1/2 = arg min z { ηcgm(zm r,k; ξm r,k) ( ς ςm r,k), z + ℓr,k+1(z)} = ℓ r,k+1(( ς ςm r,k) ηcgm(zm r,k; ξm r,k))

ςm r,k+1 = ςm r,k + ηcgm(zm r,k+1/2; ξm r,k+1/2)

Define auxiliary dual variable ωm r,k = ς ςm r,k. It satisfies immediately that zm r,k = ℓ r,k(ωm r,k), in which ℓ r,k is the conjugate of ℓr,k = ℓ+ (ηsr K + k)ηcψ. And define ωm r,k+1/2 to be the dual image of the intermediate variable zm r,k+1/2 such that zm r,k+1/2 = ℓ r,k+1(ωm r,k+1/2). Then from the above updating sequence, we get an equivalent updating sequence for the auxiliary dual variables.

ωm r,k+1/2 = ωm r,k ηgm(zm r,k; ξm r,k)

ωm r,k+1 = ωm r,k ηgm(zm r,k+1/2; ξm r,k+1/2)

Now we analyze the following shadow sequences. Define

m=1 ωm r,k, gr,k = 1

m=1 gm(zm r,k; ξm r,k),

ωr,k+1/2 = ωr,k ηcgr,k, (2)

ωr,k+1 = ωr,k ηcgr,k+1/2. (3)

In the meantime,

d zr,k = ℓ r,k(ωr,k), \ zr,k+1/2 = ℓ r,k+1(ωr,k+1/2). (4)

E.1 MAIN THEOREM AND PROOF

Theorem 1 (Main). Under assumptions, the duality gap evaluated with the ergodic sequence generated by the intermediate steps of Fe Dual Ex in Algorithm 1 is bounded by

k=0 \ zr,k+1/2 i B ηc RK + 20β2(ηc)3K2G2 + 5σ2ηc

M + 2 3 2 βηc KGB 1 2 .

Choosing step size ηc = min{ 1

5 1 2 β , B 1 4

20 1 4 β 1 2 G 1 2 K 3 4 R 1 4 , B 1 2 M 1 2

5 1 2 σR 1 2 K 1 2 , B 1 4

2 3 4 β 1 2 G 1 2 KR 1 2 },

k=0 \ zr,k+1/2 i 5 1 2 βB RK + 20 1 4 β 1 2 G 1 2 B 3 4

K 1 4 R 3 4 + 5 1 2 σB 1 2

M 1 2 R 1 2 K 1 2 + 2 3 4 β 1 2 G 1 2 B 3 4

Proof. The proof of the main theorem relies on Lemma 1, the bound for the non-smooth term, and Lemma 2, the bound for the smooth term. These two lemmas are combined in Lemma 3 and then yield the per-step progress for Fe Dual Ex. The three lemmas are listed and proved right after this theorem. Here, we finish proving the main theorem from the per-step progress.

Published as a conference paper at ICLR 2024

Starting from Lemma 3, we telescope for all local updates k {0, ..., K 1} after the same communication round r.

ηc E h K 1 X

g( \ zr,k+1/2), \ zr,k+1/2 z + ψ( \ zr,k+1/2) ψ(z) i

V ℓr,0 ωr,0 (z) V ℓr,K ωr,K (z) + 5σ2(ηc)2K

k=0 β2(ηc)4(k + 1)2G2 + 2 3 2

k=0 β(ηc)2(k + 1)GB 1 2

V ℓr,0 ωr,0 (z) V ℓr,K ωr,K (z) + 5σ2(ηc)2K

k=0 β2(ηc)4K2G2 + 2 3 2

k=0 β(ηc)2KGB 1 2

V ℓr,0 ωr,0 (z) V ℓr,K ωr,K (z) + 5σ2(ηc)2K

M + 20β2(ηc)4K3G2 + 2 3 2 β(ηc)2K2GB 1 2 .

As we initialize the local dual updates on all clients after each communication with the dual average of the previous round s last update, r {1, ..., R}, the first variable in this round ωr,0 is the same as the last variable ωr 1,0 in the previous round. As a result, taking the server step size ηs = 1, we can further telescope across all rounds and have

ηc E h R 1 X

g( \ zr,k+1/2), \ zr,k+1/2 z + ψ( \ zr,k+1/2) ψ(z) i

ω0,0 (z) V ℓR,K

ωR,K (z) + 5σ2(ηc)2KR

M + 20β2(ηc)4K3RG2 + 2 3 2 β(ηc)2K2RGB 1 2 .

Notice that the generalized Bregman divergence V ℓ0,0

ω0,0 (z) = V ℓ0,0 ς ς0(z) = V ℓ ς (z) = V ℓ z0(z), where

z0 = ℓ ( ς). Thus, by Assumption 4, V ℓ0,0

ω0,0 (z) B. Dividing ηc KR on both sides of the equation, we get

g( \ zr,k+1/2), \ zr,k+1/2 z + ψ( \ zr,k+1/2) ψ(z) i

B ηc RK + 5σ2ηc

M + 20β2(ηc)3K2G2 + 2 3 2 βηc KGB 1 2 .

Finally, applying Lemma 8 completes the proof.

Lemma 1 (Bounding the Regularization Term). z,

ηc ψ( \ zr,k+1/2) ψ(z) = V ℓr,k

ωr,k (z) V ℓr,k+1

ωr,k+1 (z) V ℓr,k

ωr,k ( \ zr,k+1/2) V ℓr,k+1

ωr,k+1/2(\ zr,k+1)

+ ηc gr,k+1/2 gr,k, \ zr,k+1/2 \ zr,k+1 + ηc gr,k+1/2, z \ zr,k+1/2

Proof. By the definition of generalized Bregman divergence and the updating sequence in Eq. (2), z,

ωr,k+1/2(z) = ℓr,k+1(z) ℓr,k+1( \ zr,k+1/2) ωr,k+1/2, z \ zr,k+1/2

= ℓr,k+1(z) ℓr,k+1( \ zr,k+1/2) ωr,k ηcgr,k, z \ zr,k+1/2

= ℓr,k(z) ℓr,k( \ zr,k+1/2) + ηc ψ(z) ψ( \ zr,k+1/2)

ωr,k, z \ zr,k+1/2 + ηc gr,k, z \ zr,k+1/2 . (7)

Similarly, we can have for the updating sequence in Eq. (3) that z,

ωr,k+1 (z) = ℓr,k(z) ℓr,k(\ zr,k+1) + ηc ψ(z) ψ(\ zr,k+1)

ωr,k, z \ zr,k+1 + ηc gr,k+1/2, z \ zr,k+1 . (8)

Plug z = \ zr,k+1 into Eq. (7),

ωr,k+1/2(\ zr,k+1) = ℓr,k(\ zr,k+1) ℓr,k( \ zr,k+1/2) + ηc ψ(\ zr,k+1) ψ( \ zr,k+1/2)

ωr,k, \ zr,k+1 \ zr,k+1/2 + ηc gr,k, \ zr,k+1 \ zr,k+1/2 .

Published as a conference paper at ICLR 2024

Add this up with Eq. (8),

ωr,k+1/2(\ zr,k+1) + V ℓr,k+1

ωr,k+1 (z) = ℓr,k(z) ℓr,k( \ zr,k+1/2) ωr,k, z \ zr,k+1/2 | {z } A1 + ηc ψ(z) ψ( \ zr,k+1/2)

+ ηc gr,k, \ zr,k+1 \ zr,k+1/2 + ηc gr,k+1/2, z \ zr,k+1 | {z } A2

For A1 we have

A1 = ℓr,k(z) ℓr,k(d zr,k) ωr,k, z d zr,k ℓr,k( \ zr,k+1/2) + ℓr,k(d zr,k) + ωr,k, \ zr,k+1/2 d zr,k

= V ℓr,k ωr,k (z) V ℓr,k ωr,k ( \ zr,k+1/2).

For A2 we have

A2 = ηc gr,k, \ zr,k+1 \ zr,k+1/2 + ηc gr,k+1/2, \ zr,k+1/2 \ zr,k+1 + ηc gr,k+1/2, z \ zr,k+1/2

= ηc gr,k+1/2, z \ zr,k+1/2 + ηc gr,k+1/2 gr,k, \ zr,k+1/2 \ zr,k+1

Plug A1 and A2 back in completes the proof.

For the purpose of clarity, we demonstrate how we generate the terms to be separately bounded for the smooth part with the following Lemma 2, which holds trivially by the linearity of the gradient operator g = 1 M PM m=1 gm and then direct cancellation.

Lemma 2 (Bounding the Smooth Term). z,

g( \ zr,k+1/2), \ zr,k+1/2 z = gr,k+1/2, \ zr,k+1/2 z + 1

m=1 gm(zm r,k+1/2) gr,k+1/2, \ zr,k+1/2 z

m=1 [gm( \ zr,k+1/2) gm(zm r,k+1/2)], \ zr,k+1/2 z

Based on the previous two lemmas, we arrive at the following lemma that bounds the per-step progress of Fe Dual Ex.

Lemma 3 (Per-step Progress for Fe Dual Ex in Saddle Point Setting). For ηc 1

ηc E g( \ zr,k+1/2), \ zr,k+1/2 z + ψ( \ zr,k+1/2) ψ(z)

ωr,k (z) V ℓr,k+1

ωr,k+1 (z) + 5σ2(ηc)2

M + 20β2(ηc)4(k + 1)2G2 + 2 3 2 β(ηc)2(k + 1)GB 1 2 .

Proof. Based on the previous two lemmas, we can get the following simply by summing them up, in which we denote the left-hand side as LHS for simplicity.

LHS := ηc g( \ zr,k+1/2), \ zr,k+1/2 z + ψ( \ zr,k+1/2) ψ(z)

ωr,k (z) V ℓr,k+1

ωr,k+1 (z) V ℓr,k

ωr,k ( \ zr,k+1/2) V ℓr,k+1

ωr,k+1/2(\ zr,k+1) | {z } A3 + ηc gr,k+1/2 gr,k, \ zr,k+1/2 \ zr,k+1

m=1 gm(zm r,k+1/2) gr,k+1/2, \ zr,k+1/2 z

m=1 [gm( \ zr,k+1/2) gm(zm r,k+1/2)], \ zr,k+1/2 z

Published as a conference paper at ICLR 2024

For the two generalized Bregman divergence terms in A3, we bound them by Lemma 10 and the strong convexity of ℓin Remark 4,

A3 V ℓ d zr,k( \ zr,k+1/2) V ℓ \ zr,k+1/2(\ zr,k+1)

2 d zr,k \ zr,k+1/2 2 1

2 \ zr,k+1/2 \ zr,k+1 2

As a result,

ωr,k (z) V ℓr,k+1

ωr,k+1 (z) 1

2 d zr,k \ zr,k+1/2 2

2 \ zr,k+1/2 \ zr,k+1 2 + ηc gr,k+1/2 gr,k, \ zr,k+1/2 \ zr,k+1 | {z } A4

m=1 gm(zm r,k+1/2) gr,k+1/2, \ zr,k+1/2 z

m=1 [gm( \ zr,k+1/2) gm(zm r,k+1/2)], \ zr,k+1/2 z .

A4 can be bounded with Cauchy-Schwarz (Lemma 5) inequality and Young s inequality (Lemma 6).

2 \ zr,k+1/2 \ zr,k+1 2 + ηc gr,k+1/2 gr,k \ zr,k+1/2 \ zr,k+1

2 \ zr,k+1/2 \ zr,k+1 2 + (ηc)2

2 gr,k+1/2 gr,k 2 + 1

2 \ zr,k+1/2 \ zr,k+1 2

2 gr,k+1/2 gr,k 2 .

Then we have

ηc ϕ( \ zr,k+1/2) ϕ(z) V ℓr,k

ωr,k (z) V ℓr,k+1

ωr,k+1 (z) 1

2 d zr,k \ zr,k+1/2 2 + (ηc)2

2 gr,k+1/2 gr,k 2

m=1 gm(zm r,k+1/2) gr,k+1/2, \ zr,k+1/2 z

m=1 [gm( \ zr,k+1/2) gm(zm r,k+1/2)], \ zr,k+1/2 z .

Taking expectations on both sides we get

ηc E ϕ( \ zr,k+1/2) ϕ(z) V ℓr,k

ωr,k (z) V ℓr,k+1

2E d zr,k \ zr,k+1/2 2

2 E gr,k+1/2 gr,k 2

m=1 gm(zm r,k+1/2) gr,k+1/2, \ zr,k+1/2 z

m=1 [gm( \ zr,k+1/2) gm(zm r,k+1/2)], \ zr,k+1/2 z

Published as a conference paper at ICLR 2024

B2 is bounded in Lemma 14. Therefore, we have

B1 + B2 (ηc)2

M + 40β2(ηc)2(k + 1)2G2

2 E h \ zr,k+1/2 d zr,k 2i 1

2E d zr,k \ zr,k+1/2 2

M + 40β2(ηc)2(k + 1)2G2 + 5(ηc)2β2 1

2 E h \ zr,k+1/2 d zr,k 2i

M + 20β2(ηc)4(k + 1)2G2,

B3 is zero after taking the expectation as shown in Lemma 11. B4 is bounded in Lemma 13. Plugging the bounds for B1 + B2, B3, and B4 back in completes the proof.

E.2 HELPING LEMMAS

In this section, we list the helping lemmas that were referenced in the proof of Lemma 1, 2, and 3. Lemma 11 (Unbiased Gradient Estimate). Under Assumption 1 and 2,

ηc EFr,k+1/2 1

m=1 gm(zm r,k+1/2) gr,k+1/2, \ zr,k+1/2 z = 0

Proof. By the unbiased gradient estimate in Assumption 2b and its following Remark 2,

ηc EFr,k+1/2 1

m=1 gm(zm r,k+1/2) gr,k+1/2, \ zr,k+1/2 z

= ηc EFr,k EFr,k+1/2 1

m=1 gm(zm r,k+1/2) gr,k+1/2, \ zr,k+1/2 z Fr,k

Lemma 12 (Bounded Client Drift under Assumption 2c). m [M], k {0, ..., K 1}, \ zr,k+1/2 zm r,k+1/2 2ηc(k + 1)G

d zr,k zm r,k 2ηck G

Proof. By the smoothness of the conjugate of a strongly convex function, i.e., Lemma 9, \ zr,k+1/2 zm r,k+1/2 = ℓ r,k(ωr,k+1/2) ℓ r,k(ωm r,k+1/2)

ωr,k+1/2 ωm r,k+1/2

After the same round of communication, by the updating sequence, we have m [M]: ωm r,k+1/2 = ωm r,k ηcgm(zm r,k; ξm r,k)

ℓ=0 gm(zm r,ℓ+1/2; ξm r,ℓ+1/2) ηcgm(zm r,k; ξm r,k)

Immediately after each round of communication, all machines are synchronized, i.e., m1, m2 [M], ωm1 r,0 = ωm2 r,0 . Therefore, k {0, ..., K 1},

ωm1 r,k+1/2 ωm2 r,k+1/2 = ηc k 1 X

ℓ=0 gm1(zm1 r,ℓ+1/2; ξm1 r,ℓ+1/2) ηcgm1(zm1 r,k ; ξm1 r,k )

ℓ=0 gm2(zm2 r,ℓ+1/2; ξm2 r,ℓ+1/2) + ηcgm2(zm2 r,k ; ξm2 r,k )

Published as a conference paper at ICLR 2024

Then m1, m2 [M], k {0, ..., K 1}, by triangle inequality, Jensen s inequality, and the bounded gradient Assumption 2c,

ωm1 r,k+1/2 ωm2 r,k+1/2 ηc k 1 X

ℓ=0 gm1(zm1 r,ℓ+1/2; ξm1 r,ℓ+1/2) + gm1(zm1 r,k ; ξm1 r,k )

ℓ=0 gm2(zm2 r,ℓ+1/2; ξm2 r,ℓ+1/2) + gm2(zm2 r,k ; ξm2 r,k )

2ηc(k + 1)G.

As a result,

\ zr,k+1/2 zm r,k+1/2 ωr,k+1/2 ωm r,k+1/2 sup m1,m2 ωm1 r,k+1/2 ωm2 r,k+1/2

2ηc(k + 1)G.

Similarly, we can show that

d zr,k zm r,k 2ηck G.

Lemma 13. Under Assumption 1-4,

m=1 [gm( \ zr,k+1/2) gm(zm r,k+1/2)], \ zr,k+1/2 z 2 3 2 β(ηc)2(k + 1)GB 1 2 .

Proof. The proof of this lemma relies on the bounded client drift in Lemma 12. We start by splitting the inner product using Cauchy-Schwarz inequality in Lemma 5, and state the reference for the following derivation in the parenthesis.

m=1 [gm( \ zr,k+1/2) gm(zm r,k+1/2)], \ zr,k+1/2 z

m=1 [gm( \ zr,k+1/2) gm(zm r,k+1/2)] \ zr,k+1/2 z

m=1 gm( \ zr,k+1/2) gm(zm r,k+1/2) \ zr,k+1/2 z (Jensen s)

m=1 β \ zr,k+1/2 zm r,k+1/2 \ zr,k+1/2 z (Smoothness)

m=1 2βηc(k + 1)G \ zr,k+1/2 z (Lemma 12)

ηc E 2βηc(k + 1)G q

2V ℓz ( \ zr,k+1/2) (Strong-convexity of ℓ)

2 3 2 β(ηc)2(k + 1)GB 1 2 (Assumption 4)

Lemma 14 (Difference of Gradient and Extra-gradient). Under Assumption 1-4,

E gr,k+1/2 gr,k 2 10σ2

M + 40β2(ηc)2(k + 1)2G2 + 5β2E h \ zr,k+1/2 d zr,k 2i .

Published as a conference paper at ICLR 2024

Proof. By Lemma 7, EFr,k+1/2 gr,k+1/2 gr,k 2

m=1 gm(zm r,k+1/2)

m=1 gm(zm r,k) gr,k + 1

gm(zm r,k+1/2) gm( \ zr,k+1/2)

gm(d zr,k) gm(zm r,k) + 1

gm( \ zr,k+1/2) gm(d zr,k) 2

5 E h gr,k+1/2 1

m=1 gm(zm r,k+1/2) 2 i

m=1 gm(zm r,k) gr,k 2 i

gm(zm r,k+1/2) gm( \ zr,k+1/2) 2 i

gm(d zr,k) gm(zm r,k) 2 i

gm( \ zr,k+1/2) gm(d zr,k) 2 i

For C1, by Assumption 2b and its following Remark 2,

C1 = EFr,k+1/2 h 1

m=1 gm(zm r,k+1/2; ξm r,k+1/2) 1

m=1 gm(zm r,k+1/2) 2 i

= 1 M 2 EFr,k+1/2 h

gm(zm r,k+1/2; ξm r,k+1/2) gm(zm r,k+1/2) 2 i

= 1 M 2 Var Fr,k+1/2 h M X

gm(zm r,k+1/2; ξm r,k+1/2) gm(zm r,k+1/2) i

m=1 Var Fr,k+1/2 h gm(zm r,k+1/2; ξm r,k+1/2) gm(zm r,k+1/2) i (i.i.d.)

m=1 EFr,k+1/2 h gm(zm r,k+1/2; ξm r,k+1/2) gm(zm r,k+1/2) 2 i

m=1 EFr,k h EFr,k+1/2 gm(zm r,k+1/2; ξm r,k+1/2) gm(zm r,k+1/2) 2 Fr,k i

M Similarly, we have C2 σ2

For C3, by Lemma 7, β-smoothness of fm, and finally Lemma 12, we have

m=1 gm(zm r,k+1/2) gm( \ zr,k+1/2) 2 i

m=1 E h zm r,k+1/2 \ zr,k+1/2 2i

4β2(ηc)2(k + 1)2G2

Published as a conference paper at ICLR 2024

Similarly for C4, we have C4 4β2(ηc)2k2G2.

For C5, by Lemma 7, β-smoothness of fm from Assumption 2a, and finally Lemma 12,

gm( \ zr,k+1/2) gm(d zr,k) 2 i

m=1 gm( \ zr,k+1/2)) gm(d zr,k) 2 i

β2E h \ zr,k+1/2 d zr,k 2i .

Plugging the bounds for C1, C2, C3, C4, and C5 back in completes the proof.

F COMPLETE ANALYSIS OF FEDUALEX FOR COMPOSITE CONVEX OPTIMIZATION

In this section, we reduce the problem to composite convex optimization in the following form: min x X ϕ(x) = f(x) + ψ(x) (9)

where f(x) = 1 M PM m=1 fm(x). The analysis builds upon the strong-convexity of the distancegenerating function h in Assumption 3 and the following set of assumptions in the convex optimization setting: Assumption 5. We make the following assumptions:

a.(Convexity of f) m [M], fm is convex. That is, x, x X,

fm(x) fm(x ) fm(x), x x .

b.(Local Smoothness of f) m [M], fm is β-smooth: x, x X,

fm(x) fm(x ) + fm(x ), x x + β

c.(Convexity of ψ) ψ(x) is convex.

d.(Local Unbiased Estimate and Bounded Variance) For any client m [M], the local gradient queried by some local random sample ξm is unbiased and also bounded in variance, i.e., Eξ[gm(xm; ξm)] = gm(xm) and Eξ[ gm(xm; ξm) gm(xm) 2 ] σ2.

e.(Bounded Gradient) m [M], gm(xm; ξm) G.

Federated dual extrapolation for composite convex optimization is to replace the part of Algorithm 1 highlighted in green with the following updating sequence, where we overuse ς now as the notation for dual variables in the convex setting as well. ςm r,0 = ςr for k = 0, 1, . . . , K 1 do

xm r,k = Prox hr,k ς (ςm r,k)

xm r,k+1/2 = Prox hr,k+1 ς ςm r,k(ηcgm(xm r,k; ξm r,k))

ςm r,k+1 = ςm r,k + ηcgm(xm r,k+1/2; ξm r,k+1/2)

end for For the proximal operator defined by hr,k, reformulating from its Definition 4 to h r,k in Definition 10 yields xm r,k = arg min x { ςm r,k ς, x + hr,k(x)} = h r,k( ς ςm r,k)

xm r,k+1/2 = arg min x { ηcgm(xm r,k; ξm r,k) ( ς ςm r,k), x + hr,k+1(x)} = h r,k+1(( ς ςm r,k) ηcgm(xm r,k; ξm r,k))

ςm r,k+1 = ςm r,k + ηcgm(xm r,k+1/2; ξm r,k+1/2)

Published as a conference paper at ICLR 2024

Similarly, we define auxiliary dual variable µm r,k = ς ςm r,k and µm r,k+1/2 the dual image of xm r,k+1/2. Then by definition, xm r,k = h r,k(µm r,k) and xm r,k+1/2 = h r,k+1(µm r,k+1/2). The updating sequence is equivalent to µm r,k+1/2 = µm r,k ηgm(xm r,k; ξm r,k) followed by µm r,k+1 = µm r,k

ηgm(xm r,k+1/2; ξm r,k+1/2). For the shadow sequence of averaged variables µr,k = 1 M PM m=1 µm r,k and

gr,k = 1 M PM m=1 gm(xm r,k; ξm r,k),

µr,k+1/2 = µr,k ηcgr,k, (10)

µr,k+1 = µr,k ηcgr,k+1/2. (11)

Finally, the projections of the averaged dual back to the primal space are d xr,k = h r,k(µr,k) and \ xr,k+1/2 = h r,k+1(µr,k+1/2) Theorem 2. Under Assumption 5, the ergodic intermediate sequence generated by Fe Dual Ex for composite convex objectives satisfies

k=0 \ xr,k+1/2) ϕ(x) B ηc RK + 20β2(ηc)3K2G2 + 5σ2ηc

M + 2β(ηc)3K2G2.

Choosing step size

ηc = min{ 1

5 1 2 β , B 1 4

20 1 4 β 1 2 G 1 2 K 3 4 R 1 4 , B 1 2 M 1 2

5 1 2 σR 1 2 K 1 2 , B 1 3

2 1 3 β 1 3 G 2 3 KR 1 3 }

further yields the following convergence rate:

k=0 \ xr,k+1/2) ϕ(x) 5 1 2 βB RK + 20 1 4 β 1 2 G 1 2 B 3 4

K 1 4 R 3 4 + 5 1 2 σB 1 2

M 1 2 R 1 2 K 1 2 + 2 1 3 β 1 3 G 2 3 B 2 3

Proof. As the proof for Theorem 1, the proof for this theorem depends on Lemma 15 and Lemma 16, which further yield Lemma 17. These lemmas are presented and proved right after this theorem. Here, we start from Lemma 17. Telescoping over all k {0, ..., K 1} and all r {0, ..., R 1} assuming ηs = 1 yields

k=0 ϕ( \ xr,k+1/2) RKϕ(x) V h0,0

µ0,0 (x) V h R,K

µR,K (x) + 5σ2(ηc)2KR

+ 20β2(ηc)4K3RG2 + 2β(ηc)3K3RG2.

By Assumption 4, V h0,0

µ0,0 (x) = V h x0(x) B, where x0 = h ( ς). Dividing both sides by ηc KR followed by applying Jensen s inequality (Lemma 4) completes the proof.

Lemma 15 (Bounding the Regularization Term). x,

ηc ψ( \ xr,k+1/2) ψ(x) = V hr,k

µr,k (x) V hr,k+1

µr,k+1 (x) V hr,k

µr,k ( \ xr,k+1/2) V hr,k+1

µr,k+1/2( \ xr,k+1)

+ ηc gr,k+1/2 gr,k, \ xr,k+1/2 \ xr,k+1 + ηc gr,k+1/2, x \ xr,k+1/2

Proof. The proof of this Lemma is almost identical to the proof of Lemma 1 with a mere change of variables and distance-generating function from saddle point setting to convex setting.

The following Lemma highlights the primary difference in the analysis of convex optimization and saddle point optimization. The smoothness of fm provides an alternative presentation to gradient Lipschitzness that establishes the connection between \ xr,k+1/2, the primal projection of averaged dual on the central server, and xm r,k+1/2 on each client.

Lemma 16 (Bounding the Smooth Term). x,

f( \ xr,k+1/2) f(x) gr,k+1/2, \ xr,k+1/2 x + 1

m=1 gm(xm r,k+1/2) gr,k+1/2, \ xr,k+1/2 x

m=1 \ xr,k+1/2 xm r,k+1/2 2.

Published as a conference paper at ICLR 2024

Proof. By the smoothness fm in the form of Assumption 5b and then the convexity of fm in the form of Assumption 5a,

fm( \ xr,k+1/2) fm(xm r,k+1/2) + gm(xm r,k+1/2), \ xr,k+1/2 xm r,k+1/2 + β

2 \ xr,k+1/2 xm r,k+1/2 2

fm(xm r,k+1/2) + gm(xm r,k+1/2), \ xr,k+1/2 xm r,k+1/2 + β

2 \ xr,k+1/2 xm r,k+1/2 2

+ fm(x) fm(xm r,k+1/2) + gm(xm r,k+1/2), xm r,k+1/2 x

fm(x) + gm(xm r,k+1/2), \ xr,k+1/2 x + β

2 \ xr,k+1/2 xm r,k+1/2 2

Then for function f = 1 M PM m=1 fm,

f( \ xr,k+1/2) f(x) 1

fm( \ xr,k+1/2) fm(x)

m=1 gm(xm r,k+1/2), \ xr,k+1/2 x + 1

2 \ xr,k+1/2 xm r,k+1/2 2

= gr,k+1/2, \ xr,k+1/2 x + 1

m=1 gm(xm r,k+1/2) gr,k+1/2, \ xr,k+1/2 x

m=1 \ xr,k+1/2 xm r,k+1/2 2.

Now we are ready to present the main lemma that combines Lemma 15 and Lemma 16. For the proof, we utilize again Lemma 11, Lemma 12, and Lemma 14, all of which we claim to hold trivially in the composite convex optimization setting.

Lemma 17 (Main Lemma for Fe Dual Ex in Composite Convex Optimization). Under Assumption 5,

ηc E ϕ( \ xr,k+1/2) ϕ(x) V hr,k

µr,k (x) V hr,k+1

µr,k+1 (x) + 5σ2ηc

M + 10β2(ηc)3(2k2 + 2k + 1)G2

2M(1 ηc) + 2β(ηc)3(k + 1)2G2.

Proof. Summing the results in Lemma 15 and Lemma 16:

ηc ϕ( \ xr,k+1/2) ϕ(x) V hr,k µr,k (x) V hr,k+1 µr,k+1 (x) V hr,k µr,k ( \ xr,k+1/2) V hr,k+1 µr,k+1/2( \ xr,k+1)

+ ηc gr,k+1/2 gr,k, \ xr,k+1/2 \ xr,k+1 + ηcβ

m=1 \ xr,k+1/2 xm r,k+1/2 2

m=1 gm(xm r,k+1/2) gr,k+1/2, \ xr,k+1/2 x .

Published as a conference paper at ICLR 2024

For the latter two generalized Bregman divergence terms V hr,k

µr,k ( \ xr,k+1/2) V hr,k+1

µr,k+1/2( \ xr,k+1), we bound them by Lemma 10 and the strong convexity of h in Assumption 3. As a result,

ηc ϕ( \ xr,k+1/2) ϕ(x) V hr,k

µr,k (x) V hr,k+1

µr,k+1 (x) 1

2 d xr,k \ xr,k+1/2 2

2 \ xr,k+1/2 \ xr,k+1 2 + ηc gr,k+1/2 gr,k, \ xr,k+1/2 \ xr,k+1 | {z } A

m=1 gm(xm r,k+1/2) gr,k+1/2, \ xr,k+1/2 x

m=1 \ xr,k+1/2 xm r,k+1/2 2.

A can be bounded with Cauchy-Schwarz inequality (Lemma 5) and Young s inequality (Lemma 6).

2 \ xr,k+1/2 \ xr,k+1 2 + ηc gr,k+1/2 gr,k \ xr,k+1/2 \ xr,k+1

2 \ xr,k+1/2 \ xr,k+1 2 + (ηc)2

2 gr,k+1/2 gr,k 2 + 1

2 \ xr,k+1/2 \ xr,k+1 2

2 gr,k+1/2 gr,k 2 .

Taking expectations on both sides we get

ηc E ϕ( \ xr,k+1/2) ϕ(x) V hr,k

µr,k (x) V hr,k+1

2E d xr,k \ xr,k+1/2 2

2 E gr,k+1/2 gr,k 2

m=1 gm(xm r,k+1/2) gr,k+1/2, \ xr,k+1/2 x

m=1 E \ xr,k+1/2 xm r,k+1/2 2

B2 is bounded in Lemma 14. Therefore, for ηc 1

B1 + B2 5σ2(ηc)2

M + 20β2(ηc)4(k + 1)2G2.

B3 is zero after taking the expectation by Lemma 11. B4 is bounded in Lemma 12. Plugging the bounds for B1 + B2, B3, and B4 back in completes the proof.

G FEDUALEX IN OTHER SETTINGS

In this section, we provide the algorithm along with the convergence rate for sequential versions of Fe Dual Ex. The proofs in this section rely only on the Lipschitzness of the gradient operator. As a result, the analysis applies to both composite saddle point optimization and composite convex optimization.

G.1 STOCHASTIC DUAL EXTRAPOLATION FOR COMPOSITE SADDLE POINT OPTIMIZATION

The sequential version of Fe Dual Ex immediately yields Algorithm 3, stochastic dual extrapolation for Composite SPP. This algorithm generalizes dual extrapolation to both composite and smooth

Published as a conference paper at ICLR 2024

Algorithm 3 STOCHASTIC-DUAL-EXTRAPOLATION for Composite SPP Input: ϕ(z) = f(x, y) + ψ1(x) ψ2(y): objective function; ℓ(z): distance-generating function; g(z) = ( xf(x, y), yf(x, y)): gradient operator. Hyperparameters: T: number of iterations; η: step size. Dual Initialization: ς0 = 0: initial dual variable, ς S: fixed point in the dual space. Output: Approximate solution z = (x, y) to minx X maxy Y ϕ(x, y)

for t = 0, 1, . . . , T 1 do

zt = Prox ℓt ς (ςt) Two-step evaluation of the generalized proximal operator

zt+1/2 = Prox ℓt ς ςt(ηcg(zt; ξt)) ςt+1 = ςt + ηcg(zt+1/2; ξt+1/2) Dual variable update end for Return: 1

T PT 1 t=0 zt+1/2.

stochastic saddle point optimization with the latter taking ψ(z) = 0. Its convergence rate is analyzed in the following theorem, which to the best of our knowledge, is the first one for stochastic composite saddle point optimization.

Theorem 3. Under the sequential version of Assumption 1-4, namely with M = 1, z Z, the ergodic intermediate sequence generated by Algorithm 3 satisfies

t=0 zt+1/2) B

Choosing step size

3 1 2 β , B 1 2

3 1 2 σT 1 2 },

further yields the following convergence rate:

t=0 zt+1/2) 3 1 2 βB

T + 3 1 2 σB 1 2

Proof. By proof similar to Lemma 1, we have

η ψ(zt+1/2) ψ(z) = V ℓt ωt (z) V ℓt+1 ωt+1 (z) V ℓt ωt (zt+1/2) V ℓt+1 ωt+1/2(zt+1)

+ η gt+1/2 gt, zt+1/2 zt+1 + η gt+1/2, z zt+1/2

V ℓt ωt (z) V ℓt+1 ωt+1 (z)

2 zt zt+1/2 2 1

2 zt+1/2 zt+1 2 + η gt+1/2 gt, zt+1/2 zt+1 | {z } A + η g(zt+1/2) gt+1/2, zt+1/2 z | {z } B

η g(zt+1/2), zt+1/2 z .

where the inequality holds by Lemma 10 and the strong convexity of ℓin Remark 4, and then simply expanding the last term to build a connection between the stochastic gradient and true gradient. By

Published as a conference paper at ICLR 2024

Algorithm 4 COMPOSITE-DUAL-EXTRAPOLATION Input: ϕ(z) = f(x, y) + ψ1(x) ψ2(y): objective function; ℓ(z): distance-generating function; g(z) = ( xf(x, y), yf(x, y)): gradient operator. Hyperparameters: T: number of iterations; η: step size. Dual Initialization: ς0 = 0: initial dual variable, ς S: fixed point in the dual space. Output: Approximate solution z = (x, y) to minx X maxy Y ϕ(x, y)

for t = 0, 1, . . . , T 1 do

zt = Prox ℓt ς (ςt) Two-step evaluation of the generalized proximal operator

zt+1/2 = Prox ℓt ς ςt(ηcg(zt)) ςt+1 = ςt + ηcg(zt+1/2) Dual variable update end for Return: 1

T PT 1 t=0 zt+1/2.

Cauchy-Schwarz inequality (Lemma 5), Young s inequality (Lemma 6), and Lemma 7,

2 zt zt+1/2 2 1

2 zt+1/2 zt+1 2 + η2

2 gt+1/2 gt 2 + 1

2 zt+1/2 zt+1 2

2 zt zt+1/2 2 + η2

2 [gt+1/2 g(zt+1/2)] + [g(zt) gt] + [g(zt+1/2) g(zt)] 2

2 zt zt+1/2 2 + 3η2

2 g(zt+1/2) g(zt) 2

2 gt+1/2 g(zt+1/2) 2 + 3η2

2 g(zt) gt 2

2 zt zt+1/2 2 + 3η2

2 gt+1/2 g(zt+1/2) 2 + 3η2

2 g(zt) gt 2 ,

where the last inequality holds by the β-Lipschitzness of the gradient operator. After taking expectations, the last two terms are bounded by the variance of the gradient σ2, and B becomes zero by proof similar to Lemma 11. Therefore, for η 1

ηE g(zt+1/2), zt+1/2 z + ψ(zt+1/2) ψ(z) V ℓt ωt (z) V ℓt+1 ωt+1 (z) + 3η2σ2.

Telescoping over all t {0, ..., T 1} and dividing both sides by ηT completes the proof.

G.2 DETERMINISTIC DUAL EXTRAPOLATION FOR COMPOSITE SADDLE POINT OPTIMIZATION

Further removing the data-dependent noise in the gradient, we present the deterministic sequential version of Fe Dual Ex, which still generalizes Nesterov s dual extrapolation (Nesterov, 2007) to composite saddle point optimization. As a result, we term this algorithm composite dual extrapolation, as presented in Algorithm 4.

We also provide a convergence analysis, which shows that composite dual extrapolation achieves the O( 1

T ) convergence rate as its original non-composite smooth version (Nesterov, 2007), as well as composite mirror prox (Co MP) (He et al., 2015). We do so with a very simple proof based on the recently proposed notion of relative Lipschitzness (Cohen et al., 2021). We start by introducing the definition of relative Lipschitzness and a relevant lemma. Definition 11 (Relative Lipschitzness (Definition 1 in Cohen et al. (2021))). For convex distancegenerating function h : Z R, we call operator g : Z Z λ-relatively Lipschitz with respect to h if z, w, u Z,

g(w) g(z), w u λ(V h z (w) + V h w (u)).

Lemma 18 (Lemma 1 in Cohen et al. (2021)). If g is β-Lipschitz and h is α-strongly convex, g is β α-relatively Lipschitz with respect to h.

Theorem 4. Under the basic convexity assumption and β-Lipschitzness of g, z Z and η 1

composite dual extrapolation satisfies Gap( 1

T PT 1 t=0 zt+1/2) βB

Published as a conference paper at ICLR 2024

Proof. By proof similar to Lemma 1, we have

η ψ(zt+1/2) ψ(z) = V ℓt ωt (z) V ℓt+1 ωt+1 (z) V ℓt ωt (zt+1/2) V ℓt+1 ωt+1/2(zt+1)

+ η g(zt+1/2) g(zt), zt+1/2 zt+1 + η g(zt+1/2), z zt+1/2 .

By Lemma 18, we know that g is β-relatively Lipschitz with respect to ℓunder the β-Lipschitzness assumption of g and 1-strong convexity assumption of ℓ. Then by Definition 11, we have

η ψ(zt+1/2) ψ(z) + g(zt+1/2), zt+1/2 z

V ℓt ωt (z) V ℓt+1 ωt+1 (z) V ℓt ωt (zt+1/2) V ℓt+1 ωt+1/2(zt+1) + ηc g(zt+1/2) g(zt), zt+1/2 zt+1

V ℓt ωt (z) V ℓt+1 ωt+1 (z) V ℓt ωt (zt+1/2) V ℓt+1 ωt+1/2(zt+1) + ηcβ V ℓ zt(zt+1/2) + V ℓ zt+1/2(zt+1)

V ℓt ωt (z) V ℓt+1 ωt+1 (z).

where the last inequality holds for η 1

β by Lemma 10. Telescoping over all t {0, ..., T 1} and dividing both sides by ηT completes the proof.

H FEDERATED MIRROR PROX

We present Federated Mirror Prox (Fed Mi P) here in Algorithm 2 as a baseline. The part highlighted in green resembles the mirror prox algorithm introduced in Section C.1.2. We use the composite mirror map representation introduced in Section C.1.1 to avoid confusion, as the composite proximal operator we proposed for Fe Dual Ex is slightly different from that used in composite mirror descent as discussed in Section 4.1.

Algorithm 2 FEDERATED-MIRROR-PROX (Fed Mi P) for Composite SPP

Input: ϕ(z) = f(x, y) + ψ1(x) ψ2(y) = 1 M PM m=1 fm( ) + ψ1(x) ψ2(y): objective function; ℓ(z): distance-generating function; gm(z) = ( xfm(x, y), yfm(x, y)): gradient operator. Hyperparameters: R: number of rounds of communication; K: number of local update iterations; ηs: server step size; ηc: client step size. Primal Initialization: z0: initial primal variable. Output: Approximate solution z = (x, y) to minx X maxy Y ϕ(x, y)

1: for r = 0, 1, . . . , R 1 do 2: Sample a subset of clients Cr [M] 3: for m Cr in parallel do 4: zm r,0 = zr 5: for k = 0, 1, . . . , K 1 do 6: zm r,k+1/2 = (ℓ+ ηcψ) ( h(zm r,k) ηcg(zm r,k; ξm r,k)) 7: zm r,k+1 = (ℓ+ ηcψ) ( h(zm r,k) ηcg(zm r,k+1/2; ξm r,k+1/2)) 8: end for 9: end parallel for 10: r = 1 |Cr| P

m Cr(zm r,K zm r,0)

11: zr+1 = (ℓ+ ηsηc Kψ) ( h(zr) + ηs r) 12: end for 13: Return: 1 RK PR 1 r=0 PK 1 k=0 zr,k+1/2.