# stabilized_proximalpoint_methods_for_federated_optimization__a26abc76.pdf Stabilized Proximal-Point Methods for Federated Optimization Xiaowen Jiang Saarland University and CISPA xiaowen.jiang@cispa.de Anton Rodomanov CISPA anton.rodomanov@cispa.de Sebastian U. Stich CISPA stich@cispa.de In developing efficient optimization algorithms, it is crucial to account for communication constraints a significant challenge in modern Federated Learning. The best-known communication complexity among non-accelerated algorithms is achieved by DANE, a distributed proximal-point algorithm that solves local subproblems at each iteration and that can exploit second-order similarity among individual functions. However, to achieve such communication efficiency, the algorithm requires solving local subproblems sufficiently accurately resulting in slightly sub-optimal local complexity. Inspired by the hybrid-projection proximalpoint method, in this work, we propose a novel distributed algorithm S-DANE. Compared to DANE, this method uses an auxiliary sequence of prox-centers while maintaining the same deterministic communication complexity. Moreover, the accuracy condition for solving the subproblem is milder, leading to enhanced local computation efficiency. Furthermore, S-DANE supports partial client participation and arbitrary stochastic local solvers, making it attractive in practice. We further accelerate S-DANE and show that the resulting algorithm achieves the best-known communication complexity among all existing methods for distributed convex optimization while still enjoying good local computation efficiency as S-DANE. Finally, we propose adaptive variants of both methods using line search, obtaining the first provably efficient adaptive algorithms that could exploit local second-order similarity without the prior knowledge of any parameters. 1 Introduction Federated learning is a rapidly emerging large-scale machine learning framework that allows training from decentralized data sources (e.g. mobile phones or hospitals) while preserving basic privacy and security [43, 24, 31]. Developing efficient federated optimization algorithms becomes one of the central focuses due to its direct impact on the effectiveness of global machine learning models. One of the key challenges in modern federated optimization is to tackle communication bottlenecks [32]. The large-scale model parameters, coupled with relatively limited network capacity and unstable client participation, often make communication highly expensive. Therefore, the primary efficiency metric of a federated optimization algorithm is the total number of communication rounds required to reach a desired accuracy level. If two algorithms share equivalent communication complexity, their local computation efficiency becomes the second important metric. The seminal algorithm DANE [57] is an outstanding distributed optimization method. It achieves the best-known deterministic communication complexity among existing non-accelerated algorithms (on the server side) [22]. This efficiency primarily hinges upon a mild precondition regarding the Secondorder Dissimilarity δ. In numerous scenarios, like statistical learning for generalized model [19] and CISPA Helmholtz Center for Information Security, Saarbrücken, Germany 38th Conference on Neural Information Processing Systems (Neur IPS 2024). semi-supervised learning [7], δ tends to be relatively small. However, to ensure such fast convergence, DANE requires each iteration subproblem to be solved with sufficiently high accuracy. This leads to sub-optimal local computation effort across the communication rounds, which is inefficient in practice. FEDRED [22] improves this weakness by using double regularization. However, this technique is only effective when using gradient descent as the local solver but cannot easily be combined with other optimization methods. For instance, applying local accelerated gradient or second-order methods cannot improve its local computation efficiency. Moreover, it is also unclear how to extend this method to the partial client participation setting relevant to federated learning. On the other hand, the communication complexities achieved by the current accelerated methods typically cannot be directly compared with those attained by DANE, as they either depend on sub-optimal constants or additional quantities such as the number of clients n. The most relevant and state-of-the-art algorithm ACC-EXTRAGRADIENT [33] achieves a better complexity in terms of the accuracy ε but with dependency on the maximum Second-order Dissimilarity δmax which can in principle be much larger than δ (see Remark 2). Unlike most federated learning algorithms, such as FEDAVG [43], this method requires communication with all the devices at each round to compute the full gradient and then assigns one device for local computation. In contrast, FEDAVG and similar algorithms perform local computations on parallel and utilize the standard averaging to compute the global model. The follow-up work Acc SVRS [40] applies variance reduction to ACCEXTRAGRADIENT which results in less frequent full gradient updates. However, the communication complexity incurs a dependency on n which is prohibitive for cross-device setting [24] where the number of clients can be potentially very large. Thus, there exists no accelerated federated algorithm that is uniformly better than DANE in terms of communication complexity. Contributions. In this work, we aim to develop federated optimization algorithms that can achieve the best communication complexity while retaining efficient local computation. To this end, we first revisit the simple proximal-point method on a single machine. The accuracy requirement for solving the subproblem defined in this algorithm is slightly sub-optimal. Drawing inspiration from hybrid projection-proximal point method for finding zeroes of a maximal monotone operator [59], we observe that using a more stabilized prox-center improves the accuracy requirement. We make the following contributions: We develop a novel federated optimization algorithm S-DANE that achieves the best-known communication complexity (for non-accelerated methods) while also enjoying improved local computation efficiency over DANE [57]. We develop an accelerated version of S-DANE based on the Monteiro-Svaiter acceleration [46]. The resulting algorithm ACC-S-DANE achieves the best-known communication complexity among all existing methods for distributed convex optimization. Both algorithms support partial client participation and arbitrary stochastic local solvers, making them attractive in practice for federated optimization. We provide a simple analysis for both algorithms. We derive convergence estimates that are continuous in the strong convexity parameter µ. We propose adaptive variants of both algorithms using line-search in the full client participation setting. The resulting methods achieve the same communication complexity (up to a logrithmic factor) as non-adaptive ones without requiring knowledge of the similarity constant. We illustrate strong practical performance of our proposed methods in experiments. See also Table 1 for a summary of the main complexity results in the full-participation setting. Related Work. Moreau first proposed the notion of the proximal approximation of a function [47]. Based on this operation, Martinet developed the first proximal-point method [42]. This method was first accelerated by Güller [17], drawing the inspiration from Nesterov s Fast gradient method [49]. Later, Lin et al. [41] introduced the celebrated CATALYST framework that builds upon Güller s acceleration. Using CATALYST acceleration, a large class of optimization algorithms can directly achieve faster convergence. In a similar spirit, Doikov and Nesterov [12] propose contracting proximal methods that can accelerate higher-order tensor methods. While Güller s acceleration has been successfully applied to many settings, its local computation is sub-optimal. Specifically, when minimizing smooth convex functions, a logarithmic dependence on the final accuracy is Algorithm # Vectors comm General convex µ-strongly convex Guarantee per round # comm rounds # local gradient queries # comm rounds # local gradient queries Scaffnew [44] a n LD2 L µ log D2+H2 0/(µL) ε q L µ in expectation SONATA [60] c n unknown unknown δmax ε d deterministic DANE [57] n δD2 L δ log LD2 ε δ µ log D2 ε deterministic Fed Red [22] n δD2 ε L δ δ µ log D2 ε L δ in expectation S-DANE (this work, Alg. 1) n δD2 L δ δ µ log D2 L δ deterministic Inexact Acc-SONATA [61] c n unknown unknown q ε deterministic Acc-Extragradient [33] c n q L δmax deterministic Catalyzed SVRP [28] e 1 w.p. 1 1 n, n w.p. 1 n unknown unknown n + n 3 4 q ε f in expectation Acc SVRS [40] e c n + n 3 4 q L δ in expectation Acc-S-DANE (this work, Alg. 2) n q L δ deterministic a For SCAFFNEW and FEDRED, the column # comm rounds represents the expected number of total communications required to reach ε accuracy. The column # local gradient queries is replaced with the expected number of local steps between two communications. b The general convex result of SCAFFNEW is established in Theorem 11 in the RANDPROX paper [8]. We assume that hi,0 = fi(x0) and estimate H2 0 := 1 n Pn i=1 hi,0 fi(x ) L2D2. Then the best p is of order 1. c SONATA, INEXACT ACC-SONATA, ACC-EXTRAGRADIENT and ACCSVRS only need to assume strong convexity of f. d Exact proximal local steps are used in SONATA e CATALYZED SVRP and ACCSVRS aim at minimizing a different measure which is the total amount of information transmitted between the server and the clients. Their iteration complexity is equivalent to the communication rounds in our notations. We refer to Remark 7 for details. f Khaled and Jin [28] assume exact evaluations of the proximal operator for the convenience of analysis. Table 1: Summary of the worst-case convergence behaviors of the considered distributed optimization methods (in the Big O-notation) assuming each fi is L-smooth and µ-convex with µ Θ(δ), where δ, δmax, ζ2 are defined in (2), Remark 2 and (3), and D := x0 x . The # local gradient queries column represents the number of gradient oracle queries required between two communication rounds to achieve the corresponding complexity, assuming the most efficient local first-order algorithms are used. The column Guarantee means whether the convergence guarantee holds in expectation or deterministically. The suboptimality ε is defined via ˆx R x 2 and f(ˆx R) f for strongly-convex and general convex functions where ˆx R is a certain output produced by the algorithm after R number of communications. 0 100 200 300 400 Communication rounds 0 20000 40000 60000 Total number of gradient oracle calls 0 5 10 15 20 25 Communication rounds local smoothness local dissimilarity (Adaptive S-DANE) Adaptive Acc-S-DANE-GD (ours) Adaptive S-DANE-GD (ours) Acc-S-DANE-GD (ours) S-DANE-GD (ours) DANE-GD Figure 1: Comparison of S-DANE and ACC-S-DANE with DANE for solving a convex quadratic minimization problem. All three methods use GD as the local solver. S-DANE has improved local computation efficiency than DANE while ACC-S-DANE further improves the communication complexity. Finally, the adaptive variants can leverage local dissimilarities to achieve better performance. (The definitions of local smoothness and dissimilarity can be found in Section 6.) incurred in its local computation complexity [12]. Solodov and Svaiter [59] proposed a HYBRID PROJECTION-PROXIMAL POINT method that allows significant relaxation of the accuracy condition for the proximal-point subproblems. More recent works such as ADAPTIVE CATALYST [21] and RECAPP [5] successfully get rid of the additional logarithmic factor for accelerated proximal-point methods as well. Another type of acceleration based on the proximal extra-gradient method was introduced by Monteiro and Svaiter [46]. This method is more general in the sense that it allows arbitrary local solvers and the convergence rates depend on the these solvers. For instance, under convexity and Lipschitz secondorder derivative, the rate can be accelerated to O(1/k3.5) by using Newton-type method. Moreover, when the gradient method is used, Monteiro-Svaiter Acceleration recovers the rate of Güller s acceleration and its accuracy requirement for the inexact solution is weaker. For minimizing smooth convex functions, one gradient step is enough for approximately solving the local subproblem [48]. This technique has been applied to centralized composite optimization, known as gradient sliding [35, 36, 33]. A comprehensive study of acceleration can be found in [14]. We defer the literature review on distributed and federated optimization algorithms to Appendix A. 2 Problem Setup and Background We consider the following distributed minimization problem: n f(x) := 1 i=1 fi(x) o , (1) where each function fi : Rd R is µ-strongly convex2 for some µ 0. We focus on the standard federated setting where the functions {fi} are distributed among n devices. The server coordinates the global optimization procedure among the devices. In each communication round, the server broadcasts certain information to the clients. The clients, in turn, perform local computation in parallel based on their own data and transmit the resulting local models back to the server to update the global model. Main objective: Given the high cost of establishing connections between the server and the clients, our paramount objective is to minimize the number of required communication rounds to achieve the desired accuracy level. This represents a central metric in federated contexts, as outlined in references such as [43, 27]. Secondary objective: Efficiency in local computation represents another pivotal metric for optimization algorithms. For instance, if two algorithms share equivalent communication complexity, the algorithm with less local computational complexity is the more favorable choice. Notation: We abbreviate [n] := {1, 2, . . . , n}. For a set A and an integer 1 s |A|, we use A s to denote the power set comprised of all s-element subsets of A. Everywhere in this paper, denotes the standard Euclidean norm (or the corresponding spectral norm for matrices). We assume problem (1) has a solution which we denote by x ; the corresponding optimal value is denoted by f . For a set S [n] s , we use f S := 1 i S fi to denote the average function over this set. We use r to denote the index of the communication round and k to denote the index of the local step. Finally, we use the superscript and subscript to denote the global and local models, respectively; for instance, xr represents the global model at round r while xi,r is the local model computed by device i at round r. 2.1 Proximal-Point Methods on Single Machine In this section, we provide a brief background on proximal-point methods [47, 42, 54, 51], which are the foundation for many distributed optimization algorithms. Proximal-Point Method. Given an iterate xk, the method defines xk+1 to be an (approximate) minimizer of the proximal-point subproblem: xk+1 arg min x Rd n Fk(x) := f(x) + λ 2 x xk 2o , (2) for an appropriately chosen parameter λ 0. This parameter allows for a trade-off between the complexity of each iteration and the rate of convergence. If λ = 0, the subproblem in each iteration is as difficult as solving the original problem because no regularization is applied. However, as λ increases, more regularization is added, simplifying the subproblem. For example, for a convex function f, the proximal-point method guarantees f( x K) f O λ K x0 x 2 , where x K := 1 K PK k=1 xk [51, 5]. However, to achieve such a convergence rate, the subproblem (2) has to be solved to a fairly high accuracy [54, 5]. For instance, the accuracy condition should either depend on the target accuracy ε, or increase with k: Fk(xk+1) = O( λ k xk+1 xk ) [58]. Indeed, when f is Lipschitz-smooth and the standard gradient descent is used as a local solver, the number of gradient steps required to solve the subproblem has a logarithmic dependence on the iteration counter k. The same issue also arises when considering accelerated proximal-point methods [12, 17]. Stabilized Proximal-Point Method. One of the key insights that we use in this work is the observation that using a different prox-center makes the accuracy condition of the subproblem weaker. 2If µ = 0, then fi is assumed to be simply convex. Algorithm 1 S-DANE: Stabilized DANE 1: Input: λ > 0, µ 0, s [n], x0 = v0 Rd. 2: for r = 0, 1, 2 . . . do 3: Sample Sr [n] s uniformly at random without replacement. 4: for each device i Sr in parallel do 5: xi,r+1 arg minx Rd Fi,r(x) := fi(x) + f Sr(vr) fi(vr), x + λ 2 x vr 2 . 6: xr+1 = 1 i Srxi,r+1. 7: vr+1 := arg minx Rd 1 i Sr[ fi(xi,r+1), x + µ 2 x xi,r+1 2] + λ The stabilized proximal-point method defines xk+1 arg min x n Fk(x) := f(x) + λ 2 x vk 2o , vk+1 = arg min x n f(xk+1), x + µ 2 x xk+1 2 + λ 2 x vk 2o , (3) where λ 0 is a parameter of the method and µ 0 is the strong-convexity constant of f. This algorithm updates the prox-center vk by performing an additional gradient step in each iteration. For instance, when µ = 0, the prox-center is updated as vk+1 = vk 1 λ f(xk+1), which is often referred to as an extra-gradient update. The stabilized proximal-point method has the same convergence rate as the original method (2) but requires only that Fk(xk+1) O(λ xk+1 vk ). As a result, there is no extra logarithmic factor of k in the oracle complexity estimate when f is L-smooth. Specifically, by setting λ = Θ(L), the previous condition can be satisfied by choosing xk+1 as the result of one gradient step from vk [48]. This shows that the stabilized proximal-point method has a better overall oracle complexity than the standard proximal-point method (c.f. Theorem 1 for the special case n = 1). It is worth noting that the former algorithm originates from the hybrid projection-proximal point algorithm [59] designed for solving the more general problem of finding zeroes of a monotone operator. In this work, we apply this algorithm in the distributed setting (n 2). 2.2 Distributed Proximal-Point Methods The proximal-point method can be adapted to solve the distributed optimization problem (1). This is the idea behind FEDPROX [39]. It replaces the global proximal step (2) by n subproblems defined as xi,r+1 := arg minx{fi(x) + λ 2 x xr 2}, which can be solved independently on each device, followed by the averaging step xr+1 = 1 n Pn i=1xi,r+1. Here we switch the notation from k to r to highlight that one iteration of the proximal-point method corresponds to a communication round in this setting. To ensure convergence, FEDPROX has to use a large λ that depends on the target accuracy as well as the heterogeneity among {fi}, which slows down the communication efficiency [39]. DANE [57] improves this by incorporating a drift correction term into the subproblem: xi,r+1 := arg min x n Fi,r(x) := fi(x) + f(xr) fi(xr), x + λ 2 x xr 2o . (4) Consequently, DANE allows to choose a much smaller λ in the algorithm. Moreover, it can exploit second-order similarity and achieve the best-known communication complexity among nonaccelerated methods [22]. However, as in the original proximal-point method, the subproblem needs to be solved sufficiently accurately leading to an extra logarithmic factor in the oracle complexity estimate. To overcome this problem, we propose new algorithms described in the following section. 3 Stabilized DANE We now describe S-DANE (Alg. 1), our proposed federated proximal-point method that employs stabilized prox-centers in its subproblems. During each communication round r, the server samples a subset of clients uniformly at random and sends vr to these clients. Then the server collects fi(vr) from these clients, computes f Sr(vr) and sends f Sr(vr) back to them. Each device in the set then calls an arbitrary local solver (which can be different on each device) to approximately solve its local subproblem. Finally, each device transmits fi(xi,r+1) and xi,r+1 back to the server which then aggregates these points and computes the new global model. As DANE, S-DANE can also achieve communication speed-up if the functions among devices are similar to each other. This is formally captured by the following assumption. Definition 1 (Second-order Dissimilarity). Let f1, . . . , fn : Rd R be functions, and let s [n], δs 0. Then, {fi}n i=1 are said to have δs-SOD (of size s) if for any x, y Rd and any S [n] s , it holds that 1 s i S h S i (x) h S i (y) 2 δ2 s x y 2, (5) where h S i := f S fi and f S := 1 Definition 1 quantifies the dissimilarity between any s functions and their average, i.e., the internal variation between any s functions. Clearly, δ1 = 0, and, when s = n, we recover the standard notion of second-order dissimilarity introduced in prior works: Definition 2 (δ-SOD [28, 40, 22]). δ-SOD := δn-SOD of size n. When each function fi is twice continuously differentiable, a simple sufficient condition for (5) is that 1 i S 2h S i (x) 2 δ2 s for any x Rd. However, this is not a necessary condition (see [22] for more details). The quantity V (x, y) in the left-hand side of (5) can be interpreted as the variance of the gradient difference estimator fˆi(x) fˆi(y), where ˆi is chosen uniformly at random from S. In particular, it can be rewritten as V (x, y) = 1 i S fi(x) fi(y) 2 f S(x) f S(y) 2. If each function fi is Li-smooth, then δs ( 1 i S L2 i )1/2 for any s [n]. However, in general, condition (5) is weaker than assuming that each fi is Lipschitz-smooth. Full Client Participation. We first consider the cross-silo setting where all the clients are highly reliable (s = n). This is typically the case with organizations and institutions having strong computing resources and stable network connection [24]. Theorem 1. Consider Algorithm 1 with s = n. Let fi : Rd R be µ-convex with µ 0 for any i [n]. Assume that {fi}n i=1 have δ-SOD. Let λ = 2δ and suppose that, for any r 0, we have i=1 Fi,r(xi,r+1) 2 λ2 i=1 xi,r+1 vr 2. (6) Then, for any R 1, it holds that3 f( x R) f µD2 2δ)R 1] δD2 where x R := 1 PR r=1 pr PR r=1 prxr for p := 1 + µ λ, and D := x0 x . To obtain f( x R) f ε for a given ε > 0, it thus suffices to perform R = O δ+µ µ log(1 + µD2 ε ) communication rounds. Theorem 1 provides the convergence guarantee for S-DANE in terms of the number of communication rounds. Note that the rate is continuous in µ. Remark 2. Some previous works express complexity estimates in terms of another constant, δmax, defined by the inequality hi(x) hi(y) δmax x y holding for any x, y Rd and any i [n], where hi = f fi. (See for instance the second line in Table 1). Note that our δ is always not larger than δmax, and can in principle be much smaller (up to n times). The proven communication complexity is the same as that of DANE [22]. However, the accuracy condition is milder. Specifically, to achieve the same guarantee, DANE requires Pn i=1 Fi,r(xi,r+1) 2 O( δ2 r2 Pn i=1 xi,r+1 xr 2), where Fi,r is defined as in (4), which incurs an r2 overhead in the denominator, as in the general discussion on proximal-point methods in Section 2.1. The next corollary shows that local computations in S-DANE could be computationally very efficient. 3Here, for µ = 0, the expression after the first inequality should be understood as the corresponding limit when µ 0; µ > 0, which is exactly the expression after the final inequality. The same remark applies to all other similar results. Corollary 3. Consider the same setting as in Theorem 1. Further, assume that each fi is L-smooth. To ensure (6) with a certain first-order algorithm, each device i needs to perform at most O( q δ ) computations of fi at each round r. Remark 4. Particular examples of algorithms that could be used to achieve the result from Corollary 3 are OGM-OG by Kim and Fessler [30] and the accumulative regularization method by Lan et al. [37], both designed for the fast minimization of the gradient norm. For the standard Gradient Method (GM), the required number of oracle calls is O( L δ ). The standard Fast Gradient Method (FGM) [49] can further decrease the complexity to O( q L µ+δ log L δ ) (see Remark 18 for details). Thus, each device can run a constant number of standard local (F)GM steps to approximately solve their subproblems in S-DANE. Partial Client Participation. Next, we turn our attention to the cross-device setting where a large number of clients (typically mobile phones) have either unstable network connection or weak computational power [24]. In such scenarios, the server typically cannot expect all the clients to be able to participate in the communication at each round. Furthermore, the clients may typically be asked to communicate only once during the whole training and are stateless [27]. Therefore, we now consider S-DANE with partial client participation and without storing any states on devices. To prove convergence, it is necessary to assume a certain level of dissimilarity among clients. Here, we use the same assumption as in [27] to measure the gradient variance. Definition 3 (Bounded Gradient Variance [27]). Let f1, . . . , fn : Rd R be functions, and let ζ 0. We say that {fi}n i=1 have ζ-BGV if, for any x Rd and f := 1 n Pn i=1fi, it holds that i=1 fi(x) f(x) 2 ζ2. (7) Definition 3 is similar to the classical notion of uniformly bounded variance used in the context of classical stochastic gradient methods [3]. We also need the following assumption which complements Definition 1. Definition 4 (External Dissimilarity). Let f1, . . . , fn : Rd R be functions, and let s [n], s 0. Then, {fi}n i=1 are said to have s-ED (of size s) if, for any x, y Rd and any S [n] s , we have m S(x) m S(y) s x y , (8) where m S := f f S and f S := 1 Compared to Definition 1, the new Definition 4 quantifies the external variation of any s functions w.r.t. the original function f. When each fi is twice continuously differentiable, (8) is equivalent to 2m S(x) s for any x Rd. If each fi is L-smooth, then s L for any s [n]. Therefore, using both Assumptions 1 and 4 is still weaker than assuming that each fi is L-smooth. In what follows, we work with a new second-order dissimilarity measure defined as the sum δs + s. Note that δ1 + 1 = δmax and δn + n = δ. Theorem 5. Consider Algorithm 1. Let fi : Rd R be µ-convex with µ 0 for any i [n] and let n 2. Assume that {fi}n i=1 have δs-SOD, s-ED and ζ-BGV. Let λ = 4(n s) ε + 2(δs + s), and suppose that, for any r 0, we have i Sr Fi,r(xi,r+1) 2 λ2 i Sr xi,r+1 vr 2. (9) Then, to ensure that E[f( x R)] f(x ) ε for a given ε > 0, it suffices to perform at most the following number of communication rounds: R = Θ δs + s + µ log 1 + µD2 Θ (δs + s)D2 where x R := 1 PR r=1 pr PR r=1 prxr with p := 1 + µ λ, and D := x0 x . Algorithm 2 ACC-S-DANE 1: Input: λ > 0, µ 0, x0 = v0 Rd, s [n]. 2: Set A0 = 0, B0 = 1. 3: for r = 0, 1, 2, . . . do 4: Find ar+1 > 0 from the equation λ = (Ar+ar+1)Br 5: Ar+1 = Ar + ar+1, Br+1 = Br + µar+1. 6: yr = Ar Ar+1 xr + ar+1 Ar+1 vr. 7: Sample Sr [n] s uniformly at random without replacement. 8: for each device i Sr in parallel do 9: xi,r+1 arg minx Rd Fi,r(x) := fi(x) + f Sr(yr) fi(yr), x + λ 2 x yr 2 . 10: xr+1 = 1 i Srxi,r+1. 11: vr+1 = arg minx Rd ar+1 i Sr[ fi(xi,r+1), x + µ 2 x xi,r+1 2] + Br Theorem 5 provides the communication complexity of S-DANE with client sampling and arbitrary (deterministic) local solvers. The rate is again continuous in µ. Compared with the previous case of s = n, the efficiency now depends on the gradient variance ζ. Note that this error term gets reduced when s increases. Specifically, to achieve the O(log 1 ε) and O( 1 ε) rates, it suffices to ensure that ζ2+nε(δs+ s)). Notably, the algorithm can reach any target accuracy even when n . Observe that the accuracy requirement (9) is the same as (6). Therefore, the discussions therein are valid in the partial-participation setting as well. In particular, if each fi is L-smooth, then the number of oracle calls to fi required at each round could be as small as O( q L λ ) (see Corollary 3). At the same time, it is also possible to use a stochastic optimization algorithm as a local solver (for more details, see Section C.3.1). 4 Accelerated S-DANE In this section, we present the accelerated version of S-DANE, ACC-S-DANE (Alg. 2), that achieves a better communication complexity compared to the basic method. For simplicity, we only consider the full-participation setting and defer the partial participation to Appendix D.3. Theorem 6. Consider Algorithm 2 with s = n. Let fi : Rd R be µ-convex with µ 0 for any i [n]. Assume that {fi}n i=1 have δ-SOD (δ > 0). Let λ = 2δ and suppose that, for any r 0, we have Pn i=1 Fi,r(xi,r+1) 2 δ2 Pn i=1 xi,r+1 yr 2. If µ 8δ, then, for any R 1, f(x R) f 2µD2 1 + p µ 8δ R 2 4δD2 where D := x0 x . Otherwise, f(x R) f 4δD2 8δ )2(R 1) for any R 1. To ensure that f(x R) f ε for a given ε > 0, it thus suffices to perform R = O q communication rounds. Let us consider the most interesting regime when µ 8δ. Comparing Theorems 1 and 6, we see that ACC-S-DANE essentially extracts the square root of the corresponding communication complexity of S-DANE by improving it from O( δ δ µ) when µ > 0, and from O δD2 ε ) when µ = 0, while maintaining the same accuracy condition for solving the subproblem. Compared with ACC-EXTRAGRADIENT, the complexity depends on a better constant δ instead of δmax. Note that we can satisfy the accuracy condition in Theorem 6 in exactly the same way as in Corollary 3. In particular, if each fi is L-smooth, each device i needs at most O( q δ ) computations of fi at each round r when using a fast algorithm for the gradient norm minimization. Finally, let us highlight that Algorithm 2 gives a distributed framework for a generic acceleration scheme, that applies to a large class of local optimization methods in the same spirit as in the famous CATALYST [41] framework that applies to the case where n = 1. However, in contrast to CATALYST, this stabilized version removes the logarithmic overhead present in the original method. Specifically, when applying Theorem 6 with n = 1 and λ = L for a smooth convex function f, we recover the same rate as CATALYST. The accuracy condition Fr(xr+1) L xr+1 yr , or equivalently f(xr+1), yr xr+1 1 2L f(xr+1) 2 can be achieved with one gradient step xr+1 := yr 1 L f(yr) (see Lemma 5 in [48]). 5 Dynamic Estimation of Similarity Constant by Line Search One drawback of Algorithms 1 and 2 is that they require the knowledge of the similarity constant δ to choose an appropriate value for λ. This similarity constant is typically unknown in practice and might be difficult to estimate. One effective solution to this problem is to dynamically adjust the coefficient λ inside the algorithms by using the classical technique of line search. The basic idea is as follows. The server first picks an arbitrary sufficiently small constant λ as an initial approximation to the unknown correct value of λ = 2δ. Then, at every round, the server sends the current estimate of λ to each client asking them to approximately solve their local subproblem. After receiving the corresponding local solutions, the server checks a certain inequality based on the obtained information. If this inequality is satisfied, the server accepts the resulting aggregated solution and goes to the next round while decreasing λ in two times (so as to be more optimistic in the future). Otherwise, it increases λ in two times, asks the clients to solve their subproblems with the new value of λ, and checks the inequality again. The precise versions of Algorithms 1 and 2 with line search for the full-participation setting are presented in Algorithms 3 and 4. Importantly, our adaptive schemes are not just some heuristics but are probably efficient. Specifically, their complexity estimates (in terms of the total number of communication rounds) are exactly the same as those given by Theorems 1 and 6, respectively, up to an extra additive logarithmic term of log 2δ λ (see Theorems 27 and 28). Another significant advantage of our adaptive algorithms is their ability to exploit local similarity, resulting in much stronger practical performance compared to the methods with fixed λ. We will demonstrate this in the next section. 6 Numerical Experiments In this section, we illustrate the performance of our methods in numerical experiments. The implementation can be found at https://github.com/mlolab/S-DANE. Convex quadratic minimization. We first illustrate the properties of our algorithms as applied to minimizing a simple quadratic function: f(x) := 1 n Pn i=1fi(x) where fi(x) := 1 m Pm j=1 1 2 Ai,j(x bi,j), x bi,j where bi,j Rd and Ai,j Rd d. The experimental details can be found in Appendix F.1. From Figure 1, we see that S-DANE converges as fast as DANE in terms of communication rounds, but with much fewer local gradient oracle calls. ACC-S-DANE achieves the best performance among the three methods. We also test S-DANE and DANE with the same fixed number of local steps. The result can be seen in Figure E.1 where S-DANE is again more efficient. Finally, we report the strong performances of two adaptive variants (Algorithms 3 and 4 with initial λ = 10 3). We see from Figure 1 that the method can automatically change λ to adapt to the local second-order dissimilarity. (We use f(vr+1) f(vr) vr+1 vr and q 1 n Pn i=1 hi(vr+1) hi(vr) 2 vr+1 vr 2 to approximate the local smoothness and dissimilarity.) Strongly-convex polyhedron feasibility problem. We now consider the problem of finding a feasible point x inside a polyhedron: P = n i=1Pi, where Pi = {x : ai,j, x bi,j, j = 1, . . . , mi} and ai,j, bi,j Rd. Each individual function is defined as fi := n m Pmi j=1[ ai,j, x bi,j]2 + where Pn i=1 mi = m. We use m = 105 and d = 103. We first generate x randomly from a sphere with radius 106. We then follow [55] to generate (ai,j, bi,j) such that x is a feasible point of P and the initial point of all optimizers is outside the polyhedron. We choose the best λ from {10i}3 i= 3. We first consider the full client participation setting and use n = s = 10. We compare our proposed methods with GD, DANE with GD [57], SCAFFOLD with control variate of option I [26], SCAFFNEW [44], FEDPROX with GD [39] and ACC-EXTRAGRADIENT [33]. The 0 25 50 75 100 Communication rounds 0 50 100 150 200 Communication rounds n=100, s=80 0 50 100 150 200 Communication rounds n=100, s=40 0 100 200 300 400 Communication rounds n=100, s=10 S-DANE-GD (ours) GD DANE-GD Scaffold Scaffnew Fed Prox-GD Acc-S-DANE-GD (ours) Acc Grad Sliding Figure 2: Comparisons of different algorithms for solving the polyhedron feasibility problem. 0 20 40 Communication rounds ijcnn with non-iid splitting 0 20 40 Communication rounds local smoothness local dissimilarity (Adaptive S-DANE) 0 20 40 Communication rounds ijcnn with iid splitting 2.5 5.0 7.5 10.0 12.5 Communication rounds local smoothness local dissimilarity (Adaptive S-DANE) Adaptive S-DANE-GD (ours) Adaptive Acc-S-DANE-GD (ours) GD DANE-GD Scaffold Scaffnew Fed Prox-GD Figure 4: Illustration of the impact of adaptive λ used in (ACC-)S-DANE on the convergence of a regularized logistic regression problem on the ijcnn dataset [6]. result is shown in the first plot of Figure 2 where our proposed methods are consistently the best among all these algorithms. We next experiment with client sampling and use n = 100. We decrease the number of sampled clients from s = 80 to s = 10. In addition to our methods, we also report the performances of SCAFFOLD and FEDPROX with client sampling. From the same figure, we see that the improvement of ACC-S-DANE over S-DANE gradually disappears as s decreases. Adaptive choice of λ. We consider the standard regularized logistic regression: f(x) = 1 n Pn i=1fi(x) with fi(x) := n M Pmi j=1 log(1 + exp( yi,j ai,j, x )) + 1 2M x 2 where (ai,j, yi,j) Rd+1 are features and labels and M := Pn i=1 mi is the total number of data points in the training dataset. We use the ijcnn dataset from LIBSVM [6]. We split the dataset into 10 subsets according to the Dirichlet distribution with α = 2 (i.i.d) and α = 0.2 (non-i.i.d). From Figure 4, Adaptive (ACC-)S-DANE (Algorithm 3 and 4) converge much faster than the other best-tuned algorithms for both cases. (We set the initial λ = 10 4 for non-i.i.d and λ = 10 5 for i.i.d respectively.) Deep learning task. Finally, we consider the multi-class classification tasks with CIFAR10 [34] using Res Net-18 [18]. The details can be found in Appendix F.2. From Figure 3, we see that S-DANE (DL) 5 reaches 90% accuracy within 50 communication rounds while all the other methods are still below 90% after 80 epochs. The effectiveness of S-DANE on the training of other deep learning models such as Transformer requires further exploration. 0 20 40 60 80 Communication rounds 0 20 40 60 80 Communication rounds Test Accuary CIFAR10-Res Net18 0 20 40 60 80 local smoothness local dissimilarity S-DANE (DL)-SGD (ours) DANE-SGD Scaffold Scaffnew Fed Prox-SGD Fed Avg Figure 3: Comparison of S-DANE without control variate against other popular optimizers on multi-class classification tasks with CIFAR10 datasets using Res Net18. 7 Conclusion We have proposed new federated optimization methods (both basic and accelerated) that simultaneously achieve the best-known communication and local computation complexities. The new methods allow partial participation and arbitrary stochastic local solvers, making them attractive in practice. We further equip both algorithms with line search and the resulting schemes can adapt to the local dissimilarity without knowing the corresponding similarity constant. However, we assume that each function fi is µ-strongly convex in all the theorems. This is stronger than assuming only µ-strongly convexity of f, which is used in some prior works. Possible directions for future research include consideration of weaker assumptions as well as empirical and theoretical analyses for non-convex problems. Acknowledgments The authors are grateful to Adrien Taylor and Thomas Pethick for the reference to [59]. The authors are thankful to the anonymous reviewers for their valuable comments and suggestions. [1] Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=B7v4QMR6Z9w. [2] Aleksandr Beznosikov, Martin Takáˇc, and Alexander Gasnikov. Similarity, compression and local steps: Three pillars of efficient communications for distributed variational inequalities. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id= Rvk1wdwz1L. [3] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223 311, 2018. [4] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015. [5] Yair Carmon, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Recapp: Crafting a more efficient catalyst for convex optimization. In International Conference on Machine Learning, pages 2658 2685. PMLR, 2022. [6] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Software available at http: //www.csie.ntu.edu.tw/~cjlin/libsvm. [7] El Mahdi Chayti and Sai Praneeth Karimireddy. Optimization with access to auxiliary information. ar Xiv preprint ar Xiv:2206.00395, 2022. [8] Laurent Condat and Peter Richtárik. Randprox: Primal-dual optimization algorithms with randomized proximal updates. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=c B4N3G5ud US. [9] Laurent Condat, Grigory Malinovsky, and Peter Richtárik. Tamuna: Accelerated federated learning with local training and partial participation. ar Xiv preprint ar Xiv:2302.09832, 2023. [10] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014. [11] Olivier Devolder. Stochastic first order methods in smooth convex optimization. CORE Discussion Papers, 2011/70, 2011. [12] Nikita Doikov and Yurii Nesterov. Contracting proximal methods for smooth convex optimization. SIAM Journal on Optimization, 30(4):3146 3169, 2020. [13] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121 2159, 2011. URL http: //jmlr.org/papers/v12/duchi11a.html. [14] Alexandre d Aspremont, Damien Scieur, Adrien Taylor, et al. Acceleration methods. Foundations and Trends in Optimization, 5(1-2):1 245, 2021. [15] Liang Gao, Huazhu Fu, Li Li, Yingwen Chen, Ming Xu, and Cheng-Zhong Xu. Feddc: Federated learning with non-iid data via local drift decoupling and correction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10112 10121, 2022. [16] Michał Grudzie n, Grigory Malinovsky, and Peter Richtárik. Can 5th generation local training methods support client sampling? yes! In International Conference on Artificial Intelligence and Statistics, pages 1055 1092. PMLR, 2023. [17] Osman Güler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2 (4):649 664, 1992. doi: 10.1137/0802032. URL https://doi.org/10.1137/0802032. [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016. doi: 10.1109/CVPR.2016.90. [19] Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, and Laurent Massoulie. Statistically preconditioned accelerated gradient method for distributed optimization. In International conference on machine learning, pages 4203 4227. PMLR, 2020. [20] Zhengmian Hu and Heng Huang. Tighter analysis for Prox Skip. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 13469 13496. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/ hu23a.html. [21] Anastasiya Ivanova, Dmitry Pasechnyuk, Dmitry Grishchenko, Egor Shulgin, Alexander Gasnikov, and Vladislav Matyukhin. Adaptive catalyst for smooth convex optimization. In International Conference on Optimization and Applications, pages 20 37. Springer, 2021. [22] Xiaowen Jiang, Anton Rodomanov, and Sebastian U Stich. Federated optimization with doubly regularized drift correction. ar Xiv preprint ar Xiv:2404.08447, 2024. [23] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS 13, page 315 323. Curran Associates Inc., 2013. [24] Peter Kairouz, H. Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Koneˇcný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14 (1 2):1 210, 2021. URL https://arxiv.org/pdf/1912.04977.pdf. [25] Avetik Karagulyan, Egor Shulgin, Abdurakhmon Sadiev, and Peter Richtárik. Spam: Stochastic proximal point method with momentum variance reduction for non-convex cross-device federated learning. ar Xiv preprint ar Xiv:2405.20127, 2024. [26] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pages 5132 5143. PMLR, 2020. [27] Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. Breaking the centralized barrier for cross-device federated learning. In Advances in Neural Information Processing Systems, 2021. [28] Ahmed Khaled and Chi Jin. Faster federated optimization under second-order similarity. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=El C6LYO4Mf D. [29] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. First analysis of local gd on heterogeneous data. ar Xiv preprint ar Xiv:1909.04715, 2019. [30] Donghwan Kim and Jeffrey A Fessler. Generalizing the optimized gradient method for smooth convex minimization. SIAM Journal on Optimization, 28(2):1920 1950, 2018. [31] Jakub Koneˇcn y, H Brendan Mc Mahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. ar Xiv preprint ar Xiv:1610.02527, 2016. [32] Jakub Koneˇcn y, H Brendan Mc Mahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. ar Xiv preprint ar Xiv:1610.05492, 2016. [33] Dmitry Kovalev, Aleksandr Beznosikov, Ekaterina Borodich, Alexander Gasnikov, and Gesualdo Scutari. Optimal gradient sliding and its application to optimal distributed optimization under similarity. Advances in Neural Information Processing Systems, 35:33494 33507, 2022. [34] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html. [35] Guanghui Lan. Gradient sliding for composite optimization. Mathematical Programming, 159:201 235, 2016. [36] Guanghui Lan and Yuyuan Ouyang. Accelerated gradient sliding for structured convex optimization. ar Xiv preprint ar Xiv:1609.04905, 2016. [37] Guanghui Lan, Yuyuan Ouyang, and Zhe Zhang. Optimal and parameter-free gradient minimization methods for smooth optimization. ar Xiv preprint ar Xiv:2310.12139, 2023. [38] Bo Li, Mikkel N Schmidt, Tommy S Alstrøm, and Sebastian U Stich. On the effectiveness of partial variance reduction in federated learning with heterogeneous data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3964 3973, 2023. [39] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine learning and systems, 2:429 450, 2020. [40] Dachao Lin, Yuze Han, Haishan Ye, and Zhihua Zhang. Stochastic distributed optimization under average second-order similarity: Algorithms and analysis. Advances in Neural Information Processing Systems, 36, 2024. [41] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. Advances in neural information processing systems, 28, 2015. [42] B. Martinet. Perturbation des méthodes d optimisation. Applications. RAIRO. Analyse numérique, 12(2): 153 171, 1978. URL http://www.numdam.org/item/M2AN_1978__12_2_153_0/. [43] Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273 1282. PMLR, 2017. [44] Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtárik. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! In International Conference on Machine Learning, pages 15750 15769. PMLR, 2022. [45] Konstantin Mishchenko, Rui Li, Hongxiang Fan, and Stylianos Venieris. Federated learning under second-order data heterogeneity, 2024. URL https://openreview.net/forum?id=jkh Vr Ill Kg. [46] Renato D. C. Monteiro and B. F. Svaiter. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization, 23(2): 1092 1125, 2013. doi: 10.1137/110833786. URL https://doi.org/10.1137/110833786. [47] J.J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société Mathématique de France, 93:273 299, 1965. doi: 10.24033/bsmf.1625. URL http://www.numdam.org/articles/10.24033/ bsmf.1625/. [48] Yu. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140 (1):125 161, 2013. doi: 10.1007/s10107-012-0629-5. URL https://doi.org/10.1007/s10107-0120629-5. [49] Yurii Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition, 2018. ISBN 3319915770. [50] Yurii Nesterov and Sebastian U. Stich. Efficiency of the accelerated coordinate descent method on structured optimization problems. SIAM Journal on Optimization, 27(1):110 123, 2017. doi: 10.1137/16M1060182. URL https://doi.org/10.1137/16M1060182. [51] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3): 127 239, 2014. ISSN 2167-3888. doi: 10.1561/2400000003. URL http://dx.doi.org/10.1561/ 2400000003. [52] Kumar Kshitij Patel, Lingxiao Wang, Blake E Woodworth, Brian Bullins, and Nati Srebro. Towards optimal communication complexity in distributed non-convex optimization. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 13316 13328. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 56bd21259e28ebdc4d7e1503733bf421-Paper-Conference.pdf. [53] Sashank J Reddi, Jakub Koneˇcn y, Peter Richtárik, Barnabás Póczós, and Alex Smola. Aide: Fast and communication efficient distributed optimization. ar Xiv preprint ar Xiv:1608.06879, 2016. [54] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877 898, 1976. [55] Anton Rodomanov, Xiaowen Jiang, and Sebastian Stich. Universality of adagrad stepsizes for stochastic optimization: Inexact oracle, acceleration and variance reduction. ar Xiv preprint ar Xiv:2406.06398, 2024. [56] Abdurakhmon Sadiev, Dmitry Kovalev, and Peter Richtárik. Communication acceleration of local gradient methods via an accelerated primal-dual algorithm with an inexact prox. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=W72r B0ww LVu. [57] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pages 1000 1008. PMLR, 2014. [58] M.V. Solodov and B. F. Svaiter. A unified framework for some inexact proximal point algorithms*. Numerical Functional Analysis and Optimization, 22(7-8):1013 1035, 2001. doi: 10.1081/NFA-100108320. URL https://doi.org/10.1081/NFA-100108320. [59] M.V. Solodov and B.F. Svaiter. A hybrid projection-proximal point algorithm. Journal of Convex Analysis, 6(1):59 70, 1999. URL http://eudml.org/doc/120958. [60] Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization, 32(2):354 385, 2022. [61] Ye Tian, Gesualdo Scutari, Tianyu Cao, and Alexander Gasnikov. Acceleration in distributed optimization under similarity. In International Conference on Artificial Intelligence and Statistics, pages 5721 5756. PMLR, 2022. [62] Farshid Varno, Marzie Saghayi, Laya Rafiee Sevyeri, Sharut Gupta, Stan Matwin, and Mohammad Havaei. Adabest: Minimizing client drift in federated learning via adaptive bias estimation. In European Conference on Computer Vision, pages 710 726. Springer, 2022. [63] Blake E Woodworth, Kumar Kshitij Patel, and Nati Srebro. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33:6281 6292, 2020. [64] Honglin Yuan and Tengyu Ma. Federated accelerated stochastic gradient descent. Advances in Neural Information Processing Systems, 33:5332 5344, 2020. 1 Introduction 1 2 Problem Setup and Background 4 2.1 Proximal-Point Methods on Single Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Distributed Proximal-Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Stabilized DANE 5 4 Accelerated S-DANE 8 5 Dynamic Estimation of Similarity Constant by Line Search 9 6 Numerical Experiments 9 7 Conclusion 10 A More Related Work 16 B Technical Preliminaries 16 B.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Useful Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C Proofs for S-DANE (Algorithm 1) 20 C.1 One-Step Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.2 Full Client Participation (Proof of Theorem 1) . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.3 Partial Client Participation (Proof of Theorem 5) . . . . . . . . . . . . . . . . . . . . . . . . 22 C.3.1 Stochastic Local Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 D Proofs for Accelerated S-DANE (Algorithm 2) 25 D.1 One-Step Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D.2 Full Client Participation (Proof of Theorem 6) . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.3 Partial Client Participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 D.3.1 Stochastic Local Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 E Dynamic Estimation of Similarity Constant by Line Search 28 F Additional Details on Experiments 31 F.1 Convex Quadratics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 F.2 Deep Learning Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 F.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 G Impact Statement 32 A More Related Work In the first several years of the development for federated learning algorithms, the convergence guarantees are focused on the smoothness parameter L. The de facto standard algorithm for federated learning is FEDAVG. It reduces the communication frequency by doing multiple SGD steps on available clients before communication, which works well in practice [43]. However, in theory, if the heterogeneity among clients is large, then it suffers from the so-called client drift phenomenon [26] and might be worse than centralized mini-batch SGD [63, 29]. Numerous efforts have been made to mitigate this drift impact. FEDPROX adds an additional regularizer to the subproblem of each client based on the idea of centralized proximal point method to limit the drift of each client. However, the communication complexity still depends on the heterogeneity. The celebrated algorithm SCAFFOLD applies drift correction (similar to variance-reduction) to the update of FEDAVG and it successfully removes the impact of the heterogeneity. Afterwards, the idea of drift correction is employed in many other works [15, 62, 38, 1]. SCAFFNEW uses a special choice of control variate[44] and first illustrates the usefulness of taking standard local gradient steps under strongy-convexity, followed with more advanced methods with refined analysis and features such as client sampling and compression [20, 9, 8]. 5g CS [16] uses an approximate proximal-point step at each iteration and derives the convergence rate that is as good as SCAFFNEW, and it also supports client sampling. Later, Sadiev et al. [56] proposed APDA WITH INEXACT PROX that retains the same communication complexity as SCAFFNEW, but further provably reduces the local computation complexity. Fed AC [64] applies nesterov s acceleration in the local steps and shows provably better convergence than FEDAVG under certain assumptions. More recent works try to develop algorithms with guarantees that rely on a potentially smaller constant than L. SCAFFOLD first illustrates the usefulness of taking local steps for quadratics under Bounded Hessian Dissimilarity δmax [26]. SONATA [60] and its accelerated version [61] prove explicit communication reduction in terms of δmax under strong convexity. MIME [27] and CE-LGD [52] work on non-convex settings and show the communication improvement on δmax and the latter achieves the min-max optimal rates. ACCELERATED EXTRAGRADIENT SLIDING [33] applies gradient sliding [33] technique and shows communication reduction in terms of δmax for strongly-convex and convex functions, the local computation of which is also efficient without logarithmic dependence on the target accuracy, DANE with inexact local solvers [57, 22, 53] has been shown recently to achieve the communication dependency on δ under convexity and δmax under non-convexity. For solving convex problems, the local computation efficiency depends on the target accuracy ε. Otherwise, the accuracy condition for the subproblem should increase across the communication rounds. Hendrikx et al. [19] proposed SPAG, an accelerated method, and prove a better uniform concentration bound of the conditioning number when solving strongly-convex problems. SVRP and CATALYZED SVRP [28] transfer the idea of using the centralized proximal point method to the distributed setting and they achieve communication complexity (with a different notion) w.r.t δ. Lin et al. [40] further improves these two methods either with a better rate or with weaker assumptions based on the ACCELERATED EXTRAGRADIENT SLIDING method. Beznosikov et al. [2] uses compression to reduce the bits required to communicate and the more general problem of Variational Inequalities is considered. Under the same settings but for non-convex optimization, SABER [45] achieves communication complexity reduction with better dependency on δmax and n. Karagulyan et al. [25] proposed SPAM that allows partial client particiption. Remark 7. Khaled and Jin [28] and Lin et al. [40] consider the total amount of information transmitted between the server and clients as the main metric, which is similar to reducing the total stochastic oracle calls in centralized learning settings. This is a particularly meaningful setting if the server prefers to or has to receive/transmit vectors one by one and can set up communications very fast. The term client sampling in these works refers to sampling one client to do the local computation. However, all the clients still need to participate in the communication from time to time to provide the full gradient information. This is orthogonal to the setup of this work since we assume each device can do the calculation in parallel. In the scenarios where the number of devices is too large such that receiving all the updates becomes problematic, we consider instead the standard partial participation setting. B Technical Preliminaries B.1 Basic Definitions We use the following definitions throughout the paper. Definition 5. A differentiable function f : Rd R is called µ-convex for some µ 0 if for all x, y Rd, f(y) f(x) + f(x), y x + µ 2 x y 2. (B.1) If f is µ-convex, then, for any x, y Rd, we have (Nesterov [49], Theorem 2.1.10): µ x y f(x) f(y) . (B.2) Definition 6. A differentiable function f : Rd R is called L-smooth for some L 0 if for all x, y Rd, f(x) f(y) L x y . (B.3) If f is L-smooth, then, for any x, y Rd, we have (Nesterov [49], Lemma 1.2.3) f(y) f(x) + f(x), y x + L 2 y x 2. (B.4) Lemma 8 (Nesterov [49], Theorem 2.1.5). Let f : Rd R be convex and L-smooth. Then, for any x, y Rd, we have 1 L f(x) f(y) 2 f(x) f(y), x y . (B.5) B.2 Useful Lemmas We frequently use the following helpful lemmas for the proofs. Lemma 9. For any x, y Rd and any γ > 0, we have 2γ y 2, (B.6) x + y 2 (1 + γ) x 2 + 1 + 1 Lemma 10 (Jiang et al. [22], Lemma 14). Let {xi}n i=1 be a set of vectors in Rd and let x := 1 n Pn i=1xi. Let v Rd be an arbitrary vector. Then, i=1 xi v 2 = x v 2 + 1 i=1 xi x 2. (B.8) Lemma 11. Let (Fk) k=1 and (Dk) k=0 be two non-negative sequences such that, for any k 0, it holds that Fk+1 + Dk+1 q Dk + ε, where q (0, 1] and ε 0 are some constants. Then for all K 1 and SK := PK k=1 1 qk , we have 1 q K DK 1 q 1 q K 1D0 + ε. Proof. Indeed, for any k 0, we have qk+1 + Dk+1 qk + ε qk+1 . Summing up from k = 0 to k = K 1, we get q K D0 + SKε. Dividing both sides by SK and substituting SK = 1 1 q ( 1 q K 1), we get the claim. Lemma 12 (c.f. Lemma 2.2.4 in [49]). Let (Ar) r=0 be a non-negative non-decreasing sequence such that A0 = 0 and, for any r 0, Ar+1 c(Ar+1 Ar)2 where c > 0 and µ 0 are some constants. If µ 4c, then for any R 0, we have Otherwise, for any R 1, it holds that 2(R 1) . (B.10) Proof. Denote Cr = µAr. For any r 0, it holds that µC2 r+1(1 + C2 r) c(C2 r+1 C2 r)2 c 2(Cr+1 Cr)Cr+1 2 = 4c(Cr+1 Cr)2C2 r+1. Therefore, for any r 0: When µ 4c, by induction, one can show that, for any R 0 (see the proof of Theorem 1 in [50] for details): When µ > 4c, we have Cr+1 Cr p µ 4c Cr. It follows that, for any R 1, R 1 C1 1 + r Plugging in the definition of CR, we get the claims. Lemma 13. Let {xi}n i=1 be vectors in Rd with n 2. Let s [n] and let S [n] s be sampled uniformly at random without replacement. Let x := 1 n Pn i=1xi, ζ2 := 1 n Pn i=1 xi x 2, and x S := 1 j S xj. Then, E[ x S] = x and E[ x S x 2] = n s Proof. Let n m = n! m!(n m)! be the binomial coefficient for any n m 1. By the definition of x S, we have i=1 1[i S]xi, where 1[E] denotes the {0, 1}-indicator of the event E. Taking the expectation on both sides, we get E[ x S] = 1 i=1 Pr [i S] xi = 1 E[ x S x 2] = E h 1 j S xi x, xj x i i S xi x 2 + 1 i,j S,i =j xi x, xj x i i=1 1[i S] xi x 2 + 1 i,j [n],i =j 1[i, j S] xi x, xj x i i=1 Pr [i S] xi x 2 + 1 i,j [n],i =j Pr [i, j S] xi x, xj x n s xi x 2 + 1 i,j [n],i =j n s xi x, xj x s + s 1 sn(n 1) i,j [n],i =j xi x, xj x . i,j [n],i =j xi x, xj x = X i,j [n] xi x, xj x i=1 xi x 2 = nζ2. E[ x S x 2] = ζ2 s(n 1) = n s Lemma 14. Suppose {fi}n i=1 satisfy s-ED of size s [n] and ζ-BGV with n 2. Let f := 1 n Pn i=1 fi and f S := 1 i S fi, where S [n] s is sampled uniformly at random without replacement. Further, let y Rd be a fixed point, and let x S Rd be a random point defined by a deterministic function of S. Then, for any γ > 0, it holds that ES[f(x S) f S(x S)] n s ES[ x S y 2]. (B.12) Proof. Let h S := f f S. Since {fi} satisfy s-ED (Definition 4), we have, in view of inequality (B.4), h S(x S) h S(y) + h S(y), x S y + s (B.6) h S(y) + γ 2 h S(y) 2 + 1 2γ x S y 2 + s Rearranging and taking the expectation on both sides, we get, for any γ > 0, ES[h S(x S) h S(y)] γ 2 ES[ h S(y) 2] + 1 2γ ES[ x S y 2] + s 2 ES[ x S y 2] ES[ x S y 2], where the last inequality is due to (B.11). Using the fact that ES[f(y) f S(y)] = 0, we get the claim. Lemma 15. Suppose {fi}n i=1 satisfy δs-SOD of size s [n]. Let f S := 1 s P i S fi where s [n] and S [n] s . Let v Rd be a fixed point, λ > δs, and let Fi(x) := fi(x) + h S i (v), x + λ where h S i := f S fi. Let {xi}i S be a set of points in Rd (such that xi arg minx Fi(x) in the sense that Fi(xi) is sufficiently small), and let x S = 1 i Sxi. Then, fi(xi) + h S i ( x S), v xi 1 i S fi(xi) 2 i S v xi 2 1 i S Fi(xi) 2. Proof. Using the definition of Fi, we get Fi(xi) = fi(xi) + h S i (v) + λ(xi v). fi(xi) + h S i ( x S), v xi = λ v xi 2 + h S i ( x S) h S i (v), v xi + Fi(xi), v xi . Taking the average over i on both sides of the first display, we have i S fi(xi) = λ(v x S) + 1 i S Fi(xi). (B.13) i S fi(xi) 2 = 1 λ(v x S) + 1 i S Fi(xi) 2 2 v x S 2 + 1 i S Fi(xi), v x S + 1 i S Fi(xi) 2 . It follows that 1 s i S fi(xi) + h S i ( x S), v xi 1 i S fi(xi) 2 i S v xi 2 λ 2 v x S 2 + 1 i S h S i ( x S) h S i (v), v xi i S Fi(xi), x S xi 1 i S Fi(xi) 2 2 v x S 2 + λ1 i S xi x S 2 + 1 i S h S i ( x S) h S i (v), x S xi i S Fi(xi), x S xi 1 i S Fi(xi) 2 2 v x S 2 + λ1 i S xi x S 2 1 2δs 1 s i S h S i ( x S) h S i (v) 2 δs i S xi x S 2 i S xi x S 2 1 i S Fi(xi) 2 1 i S Fi(xi) 2 (5),(B.8) λ δs 2 v x S 2 + λ δs i S xi x S 2 1 i S Fi(xi) 2 (B.8) = λ δs i S v xi 2 1 i S Fi(xi) 2, where in the second equality, we use the fact that 1 i S[ h S i ( x S) h S i (v)] = 0. C Proofs for S-DANE (Algorithm 1) C.1 One-Step Recurrence Lemma 16. Consider Algorithm 1. Let fi : Rd R be µ-convex with µ 0 for any i [n]. Assume that {fi}n i=1 have δs-SOD. Then, for any r 0, we have 1 λ[f Sr(xr+1) f Sr(x )] + 1 + µ/λ 2 vr x 2 1 δs/λ i Sr vr xi,r+1 2 + 1 i Sr Fi,r(xi,r+1) 2. Proof. By µ-convexity of fi, for any r 0, it holds that 1 λf Sr(x ) + 1 2 vr x 2 = 1 i Sr fi(x ) + 1 h fi(xi,r+1) + fi(xi,r+1), x xi,r+1 + µ 2 xi,r+1 x 2i + 1 Recall that vr+1 is the minimizer of the final expression in x . This expression is a (1 + µ/λ)-convex function in x . We can then estimate it by: 1 λf Sr(x ) + 1 h fi(xi,r+1) + fi(xi,r+1), vr+1 xi,r+1 + µ 2 xi,r+1 vr+1 2i 2 vr vr+1 2 + 1 + µ/λ 2 vr+1 x 2. Using the convexity of fi and dropping the non-negative µ 2λ 1 s P i Sr xi,r+1 vr+1 2 , we further get 1 λf Sr(x ) + 1 i Sr [fi(xr+1) + fi(xr+1), xi,r+1 xr+1 ] + 1 + µ/λ i Sr fi(xi,r+1), vr xi,r+1 + 1 i Sr fi(xi,r+1), vr+1 vr E 2 vr+1 vr 2 λf Sr(xr+1) + 1 + µ/λ 2 vr+1 x 2 + 1 i Sr fi(xr+1), xi,r+1 xr+1 i Sr fi(xi,r+1), vr xi,r+1 1 2λ2 i Sr fi(xi,r+1) 2 . Denote hr i := f Sr fi. Note that X i Sr fi(xr+1), xi,r+1 xr+1 = X i Sr hr i (xr+1), xi,r+1 vr , where we have used: i Sr xi,r+1 and X i Sr hr i (xr+1) = 0. It follows that 1 λf Sr(x ) + 1 λf Sr(xr+1) + 1 + µ/λ i Sr fi(xi,r+1) + hr i (xr+1), vr xi,r+1 1 2λ2 i Sr fi(xi,r+1) 2 . We now apply Lemma 15 (with xi = xi,r+1, v = vr, S = Sr and x = xr+1) to get i Sr fi(xi,r+1) + hr i (xr+1), vr xi,r+1 1 i Sr fi(xi,r+1) 2 i Sr vr xi,r+1 2 1 i Sr Fi,r(xi,r+1) 2. Substituting this lower bound into the previous display, we get the claim. C.2 Full Client Participation (Proof of Theorem 1) Proof. Applying Lemma 16 and using Pn i=1 Fi,r(xi,r+1) 2 δ2 Pn i=1 xi,r+1 vr 2 and λ = 2δ, for any r 0, we have 1 λ[f(xr+1) f ] + 1 + µ/λ 2 vr x 2 1 δ/λ i=1 vr xi,r+1 2 + 1 i=1 Fi,r(xi,r+1) 2 2 vr x 2 1 1/2 vr xi,r+1 2 = 1 Rearranging, we get 2q λ [f(xr+1) f ] q vr x 2 vr+1 x 2. where q := 1 1+µ/λ. Applying Lemma 11 with ε = 0 and using convexity of f, we obtain λ [f( x R) f ] + 1 q 1 q R v R x 2 1 q (1/q)R 1 v0 x 2 = 1 q (1/q)R 1D2. Dropping the non-negative term 1 q 1 q R v R x 2 and rearranging, we get f( x R) f (1 q)λ 2q (1/q)R 1 D2. Plugging in the choice of λ and the definition of q, we get the claim. Corollary 17. Under the same setting as in Theorem 1, to achieve f( x R) f ε , we need at most the following number of communication rounds: R = O µ + δ µ log 1 + µD2 Proof. Using the fact that (1 + q)k exp( q 1+q k) for any q 0 and k > 0, we get f( x R) f µD2 2δ )R 1] µD2 2[exp( µ µ+2δ R) 1] ε. Rearranging, we get the claim. Proof of Corollary 3. Proof. To achieve Pn i=1 Fi,r(xi,r+1) 2 δ2 Pn i=1 xi,r+1 yr 2, for each i [n], it is sufficient to ensure that Fi,r(xi,r+1) δ xi,r+1 vr . Let x i,r := arg minx Fi,r(x). Since xi,r+1 vr vr x i,r xi,r+1 x i,r (B.2) vr x i,r 1 λ Fi,r(xi,r+1) and λ = 2δ, it suffices to ensure that Fi,r(xi,r+1) 2δ 3 vr x i,r (C.1) for any i [n]. According to Theorem 2 from [33] (or Theorem 3.2 from [37]), there exists a certain algorithm such that when started from the point vr, after K queries to Fi,r, it generates the point vi,r+1 such that Fi,r(xi,r+1) O (L + λ) vr x i,r K2 = O L vr x i,r K2 (recall that δ L). Setting K = Θ( q δ ) concludes the proof. Remark 18. Recall that Fi,r is (L + λ)-smooth and (µ + λ)-convex, λ = Θ(δ) and δ L. Suppose worker i uses the standard GD to approximately solve the local subproblem at round r starting at vr for K steps and return the last point, then by Lemma 19, we have that Fi,r(xi,r+1) 2 O (L+λ)2 vr x i,r 2 K2 . To satisfy the accuracy condition (C.1), it is sufficient to make K = Θ( L δ ) local steps. Suppose worker i uses the fast gradient method, then by Theorem 3.18 from [4], we have that Fi,r(xi,r+1) 2 2(L + λ) Fi,r(xi,r+1) Fi,r(x i,r) O (L + λ)2 exp( q µ+λ L+λK) vr x i,r 2 . To satisfy the accuracy condition (C.1), it suffices to make K = Θ( q L+δ µ+δ log( L+δ δ )) = Θ( q L µ+δ log( L δ )) gradient oracle calls. Lemma 19 (Theorem 2.2.5 in [49]). Let f : Rd R be a convex and L-smooth function. Consider the gradient method with constant stepsize: xk+1 = xk 1 L f(xk), k 0, started from some x0 Rd. Then, for any K 1, it holds that f(x K) O L x0 x Proof. By Theorem 2.2.5 in [49], we have that min k [K] f(xk) O L x0 x It remains to note that the algorithm generates non-increasing f(xk) since f(xk+1) 2 = f(xk+1) f(xk) + f(xk) 2 = f(xk+1) f(xk) 2 + 2 f(xk+1) f(xk), f(xk) + f(xk) 2 = f(xk+1) f(xk) 2 2L f(xk) f(xk+1), xk xk+1 + f(xk) 2 (B.5) f(xk) 2 f(xk+1) f(xk) 2 f(xk) 2. C.3 Partial Client Participation (Proof of Theorem 5) The following theorem is a slight extension of Theorem 5, which includes the use of stochastic local solvers. Theorem 20. Consider Algorithm 1. Let fi : Rd R be µ-convex with µ 0 for any i [n] and let n 2. Assume that {fi}n i=1 have δs-SOD, s-ED and ζ-BGV. Let λ = 4(n s) ε + 2(δs + s). For any r 0, suppose we have 1 s i Sr Eξi,r[ Fi,r(xi,r+1) 2] λ2 i Sr Eξi,r[ xi,r+1 vr 2] + λε for some ε > 0, where ξi,r denotes the randomness coming from device i when solving its subproblem at round r. We assume that {ξi,r} are independent random variables. To reach E[f( x R) f ] ε, we need at most the following number of communication rounds: R = Θ δs + s + µ log 1 + µD2 Θ (δs + s)D2 where x R := PR r=1 prxr/ PR r=1 pr, p := 1 + µ λ, and D := x0 x . Proof. According to Lemma 16, we have for any r 0, 1 λ[f Sr(xr+1) f Sr(x )] + 1 + µ/λ 2 vr x 2 1 δs/λ i Sr vr xi,r+1 2 + 1 i Sr Fi,r(xi,r+1) 2. According to Lemma 14 (with S = Sr, x = xr+1 and y = vr), for any γ > 0, we have ESr[f(xr+1) f Sr(xr+1)] n s ESr[ xr+1 vr 2] i Sr xi,r+1 vr 2i . λf(xr+1) to both sides of the first display, taking the expectation over Sr on both sides, substituting the previous upper bound and setting γ = s(n 1)ε 2ζ2(n s), we get 1 λ ESr[f(xr+1) f ] + 1 + µ/λ 2 ESr[ vr+1 x 2] i Sr xi,r+1 vr 2i i Sr Fi,r(xi,r+1) 2i . Denote all the randomness {ξi,r}i Sr by ξr. Since ξi,r is independent of the choice of Sr for any i [n], taking the expectation over ξr on both sides of the previous display and using our assumption (C.3), we obtain 1 λ ESr,ξr[f(xr+1) f ] + 1 + µ/λ 2 ESr,ξr[ vr+1 x 2] i Sr xi,r+1 vr 2i + ε By our choice of λ, we have λ 2 1 2γ 0. Taking the full expectation on both sides, we get 1 λ E[f(xr+1) f ] + 1 + µ/λ 2 E[ vr+1 x 2] 1 2 E[ vr x 2] + ε According to Lemma 11 and the fact that v0 x = D, we get 2 µ + λ E f( x R) f + (1 q) E v R x 2 1 q (1/q)R 1D2 + 1 µ + λε, where q := 1 1+µ/λ. Rearranging and dropping the non-negative E[ v R x 2], we get, for any R 1, E f( x R) f µD2 λ + 1)R 1] + ε 2[exp( µ µ+λR) 1] + ε To reach ε-accuracy, it suffices to let µD2 2[exp( µ µ+λ R) 1] ε 2. Rearranging gives the claim. C.3.1 Stochastic Local Solver Note that there exist many stochastic optimization algorithms that can also achieve the accuracy condition (C.3) such as variance reduction methods [23, 10], adaptive SGD methods [13], etc. Here, we take the simplest algorithm: SGD with constant stepsize as an example. Corollary 21. Consider Algorithm 1 under the same settings as in Theorem 20. Further assume that each fi is L-smooth and each device has access to mini-batch stochastic gradient gi(x, ξi) such that E ξi[gi(x, ξi)] = fi(x), E ξi[ gi(x, ξi) fi(x) 2] σ2. Suppose for any r 0, each device i Sr solves its subproblem approximately by using mini-batch SGD: zk+1 = zk 1 H gi(x, ξr i,k) fi(vr) + f Sr(vr) + λ(zk vr) , 0 k K, where z0 = vr and H > L + λ is the stepsize coefficient. Let ξi,r denote ( ξr i,k)k. To achieve accuracy condition (C.3) for an appropriately chosen H, each device i requires at most the following number of stochastic mini-batch oracle calls: K = Θ L + λ µ + λ + (L + λ)σ2 Proof. To get (C.3), it suffices to ensure that, for any i Sr, we have Eξi,r[ Fi,r(xi,r+1) 2] λ2 4 Eξi,r[ xi,r+1 vr 2] + λε For this, it suffices to ensure that Eξi,r[ Fi,r(xi,r+1) 2] λ2 10 vr x i,r 2 + λε where x i,r := arg minx Fi,r(x). Indeed, suppose (C.4) holds, then we have xi,r+1 vr vr x i,r xi,r+1 x i,r (B.2) vr x i,r 1 λ Fi,r(xi,r+1) . vr x i,r 2 2 λ2 Fi,r(xi,r+1) 2 + 2 xi,r+1 vr 2. Plugging in this inequality into (C.4) and taking expectation w.r.t ξi,r on both sides, we get Eξi,r[ Fi,r(xi,r+1) 2] 1 5 Eξi,r[ Fi,r(xi,r+1) 2] + λ2 5 Eξi,r[ xi,r+1 vr 2] + λ Rearranging gives the weaker condition. We next consider the number of mini-batch stochastic gradient oracles required for SGD to achieve (C.4). Since Fi,r is (L + λ)-smooth and (µ + λ)-convex, according to Lemma 22, we have Eξi,r[ Fi,r( z K) 2] 2(L + λ) Eξi,r[Fi,r( z K) F i,r] 2(L + λ) (µ + λ) vr x i,r 2 2[exp (µ + λ)K/H 1] + σ2 where z K := 1 PK k=1 1 qk PK k=1 zk qk and q = H µ λ H . Choosing now H = (L + λ) + 5(L+λ)σ2 λε , and letting the coefficient of the first part in the previous display be λ2 10 , we get the claim. Lemma 22. Let f be a µ-convex and L-smooth function. Consider SGD with constant stepsize H > L: xk+1 := arg min x Rd n gk, x + H 2 x xk 2o , where gk := g(xk, ξk) with Eξ[g(x, ξ)] = f(x) and Eξ[ g(x, ξ) f(x) 2] σ2 for any x Rd. Then for any K 1, we have E[f( x K)] f µ x0 x 2 2 exp(µK/H) 1 + σ2 2(H L). (C.5) where x K := 1 PK k=1 1 qk PK k=1 xk qk and q = H µ Proof. Indeed, for any k 0, we have f(xk) + gk, x xk + H f(xk) + gk, xk+1 xk + H 2 xk+1 xk 2 + H (B.3) f(xk+1) + gk f(xk), xk+1 xk + H L 2 xk+1 xk 2 + H (B.6) f(xk+1) gk f(xk) 2 2 xk+1 x 2. Taking the expectation on both sides and using µ-convexity of f, we get E[f(xk+1) f ] + H 2 E[ xk+1 x 2] H µ 2 E[ xk x 2 + σ2 Applying Lemma 11, we have for any K 1: E[f( x K) f ] µ x0 x 2 2 (1/q)K 1 + σ2 2(H L) µ x0 x 2 2 exp(µK/H) 1 + σ2 D Proofs for Accelerated S-DANE (Algorithm 2) D.1 One-Step Recurrence Lemma 23. Consider Algorithm 2. Let fi : Rd R be µ-convex with µ 0 for any i [n]. Assume that {fi}n i=1 have δs-SOD. For any r 0, we have Arf Sr(xr) + ar+1f Sr(x ) + Br Ar+1f Sr(xr+1) + Br+1 i Sr xi,r+1 yr 2 1 i Sr Fi,r(xi,r+1) 2 . Proof. By µ-convexity of fi, for any r 0, it holds that Arf Sr(xr) + ar+1f Sr(x ) + Br i Sr fi(xr) + ar+1 1 i Sr fi(x ) + Br i Sr [fi(xi,r+1) + fi(xi,r+1), xr xi,r+1 ] + Br h fi(xi,r+1) + fi(xi,r+1), x xi,r+1 + µ 2 xi,r+1 x 2i . Recall that vr+1 is the minimizer of the final expression in x . This expression is a (µar+1 + Br)-convex function in x . By convexity and using the fact that Ar+1 = Ar + ar+1 and Br+1 = µar+1 + Br, we obtain Arf Sr(xr) + ar+1f Sr(x ) + Br i Sr fi(xi,r+1) + µar+1 i Sr xi,r+1 vr+1 2 + Br 2 vr vr+1 2 i Sr fi(xi,r+1), Arxr + ar+1vr+1 Ar+1xi,r+1 + Br+1 2 vr+1 x 2. Recall that yr = Ar Ar+1 xr + ar+1 Ar+1 vr. Therefore, 2 vr vr+1 2 + 1 i Sr fi(xi,r+1), Arxr + ar+1vr+1 Ar+1xi,r+1 2 vr vr+1 2 + ar+1 D1 i Sr fi(xi,r+1), vr+1 vr E i Sr fi(xi,r+1), yr xi,r+1 (B.6) a2 r+1 2Br i Sr fi(xi,r+1) 2 + Ar+1 1 i Sr fi(xi,r+1), yr xi,r+1 . Substituting this lower bound, using convexity of fi and dropping the non-negative µar+1 i Sr xi,r+1 vr+1 2, we further get Arf Sr(xr) + ar+1f Sr(x ) + Br (B.1) Ar+1 1 i Sr [fi(xr+1) + fi(xr+1), xi,r+1 xr+1 ] + Br+1 i Sr fi(xi,r+1), yr xi,r+1 a2 r+1 2Br i Sr fi(xi,r+1) 2 . Denote hr i := f Sr fi. Substituting X i Sr fi(xr+1), xi,r+1 xr+1 = X i Sr hr i (xr+1), xi,r+1 yr into the previous display, we get Arf Sr(xr) + ar+1f Sr(x ) + Br Ar+1f Sr(xr+1) + Br+1 i Sr fi(xi,r+1) + hr i (xr+1), yr xi,r+1 a2 r+1 2Br i Sr fi(xi,r+1) 2 . We now apply Lemma 15 (with xi = xi,r+1, v = yr, S = Sr and x = xr+1) to get i Sr fi(xi,r+1) + hr i (xr+1), yr xi,r+1 1 i Sr fi(xi,r+1) 2 i Sr yr xi,r+1 2 1 i Sr Fi,r(xi,r+1) 2. Substituting this lower bound into the previous display and using Ar+1 = a2 r+1λ Br , we get the claim. D.2 Full Client Participation (Proof of Theorem 6) Proof. Applying Lemma 23 and using Pn i=1 Fi,r(xi,r+1) 2 δ2 Pn i=1 xi,r+1 yr 2 and λ = 2δ, for any r 0, we have Arf(xr) + ar+1f + Br Ar+1f(xr+1) + Br+1 2 vr+1 x 2 + Ar+1 λ δ i Sr xi,r+1 yr 2 = Ar+1f(xr+1) + Br+1 2 vr+1 x 2. Subtracting Ar+1f on both sides, we get Ar+1[f(xr+1) f ] + Br+1 2 vr+1 x 2 Ar[f(xr) f ] + Br Recursively applying the previous display from r = 0 to r = R 1, we get AR[f(x R) f ] + BR 2 v R x 2 A0[f(x0) f ] + 1 2 v0 x 2 = 1 It remains to apply Lemma 12 and plug in the estimation of the growth of AR. Corollary 24. Under the same setting as in Theorem 6, to achieve f(x R) f ε, we need at most the following number of communication rounds: min{µ, δ}D2 Proof. When µ 8δ, by using 1 + r f(x R) f 2µD2 1 + p µ 8δ R 2 2µD2 exp( µR 8δ+ µ) 1 2 . Making the right-hand side ε and rearranging, we get the claim. When µ 8δ, it suffices to ensure that δD2 D.3 Partial Client Participation It is well known that accelerated stochastic gradient methods are not able to improve the complexity in the stochastic part compared with the basic methods [11]. A similar result is also shown for our accelerated distributed method. Theorem 25. Consider Algorithm 2 under the same setting as in Theorem 20. Let λ = Θ (δs + s) + (n s)R and suppose that, for any r 0, we have i Sr Eξi,r[ Fi,r(xi,r+1) 2] O λ2 i Sr Eξi,r[ xi,r+1 vr 2] + λε Denote D := x0 x . Then, to ensure that E[f(x R)] f ε for some ε > 0, we need to perform at most the following number of communication rounds: R = Θ δs + s + µ µ log 1 + min{µ, λ}D2 sεµ log2 1 + min{µ, λ}D2 The error term that depends on ζ2 and ε is at the same scale as S-DANE, i.e. O( ζ2 ε ) when µ > 0 and O( ζ2 ε2 ) when µ = 0. Nevertheless, when s is large enough such that this second error becomes no larger than the first optimization term, then ACC-S-DANE can still be faster than S-DANE. Proof of Theorem 25. Proof. According to Lemma 23, we have, for any r 0, Arf Sr(xr) + ar+1f Sr(x ) + Br Ar+1f Sr(xr+1) + Br+1 + Ar+1 λ δs i Sr xi,r+1 yr 2 Ar+1 1 i Sr Fi,r(xi,r+1) 2. According to Lemma 14 (with S = Sr, x = xr+1 and y = yr), for any γ > 0, we have ESr f(xr+1) f Sr(xr+1) n s ESr[ xr+1 yr 2] i Sr xi,r+1 yr 2i . Adding Ar+1f(xr+1) to both sides of the first display, taking the expectation over Sr on both sides, substituting the previous upper bound, and setting γ = s(n 1)ε 2ζ2(n s) with ε > 0, we get Arf(xr) + ar+1f + Br Ar+1 ESr[f(xr+1)] + Br+1 2 ESr[ vr+1 x 2] i Sr xi,r+1 yr 2i i Sr Fi,r(xi,r+1) 2i . Denote all the randomness {ξi,r}i Sr by ξr. Since ξi,r is independent of the choice of Sr for any i [n], taking the expectation over ξr on both sides of the previous display and using the assumption that ESr,ξr h 1 s P i Sr Fi,r(xi,r+1) 2i ESr,ξr h λ2 i Sr xi,r+1 yr 2i + λε 4 , we obtain Arf(xr) + ar+1f + Br Ar+1 ESr,ξr[f(xr+1)] + Br+1 2 ESr,ξr[ vr+1 x 2] i Sr xi,r+1 yr 2i Ar+1 By choosing λ = 4ζ2(n s) s(n 1)ε + 2(δs + s), we have that λ 2 1 2γ 0. Taking the full expectation on both sides, we get Ar E[f(xr)] + ar+1f + Br 2 E[ vr x 2] Ar+1 E[f(xr+1)] + Br+1 2 E[ vr+1 x 2] Ar+1 Subtracting Ar+1f(x ) on both sides , summing up from r = 0 to r = R 1 and using the fact that A0 = 0, v0 = x0 and B0 = 1, we get AR E[f(x R) f ] + BR 2 E[ v R x 2] 1 2 x0 x 2 + ε Dividing both sides by AR, setting ε = ε R and using the fact that the sequence {Ar} is non-decreasing, we get E[f(x R)] f 1 2AR x0 x 2 + ε We now apply Lemma 12 with c = λ to get AR [(1 + p µ 4λ)R (1 p µ 4µ [(1 + p µ when µ 4λ, and AR 1 4λ(1 + p µ 4λ)2(R 1) when µ 4λ. Letting these lower bounds be larger than λ µ log 1 + min{µ, λ} x0 x 2 Plugging λ = Ω ζ2(n s)R snε + (δs + s) into the last display and rearranging, we get the condition for R. D.3.1 Stochastic Local Solver Corollary 26. Consider Algorithm 2 under the same settings as in Theorem 25. Consider the same stochastic local solver used in Corollary 21. To achieve i Sr Eξi,r[ Fi,r(xi,r+1) 2] O λ2 i Sr Eξi,r[ xi,r+1 yr 2] + λε each device i requires at most the following number of stochastic mini-batch oracle calls: K = Θ L + λ µ + λ + (L + λ)σ2R Proof. The proof is the same as that for Corollary 21. E Dynamic Estimation of Similarity Constant by Line Search Theorem 27. Consider Algorithm 3. Suppose that each function fi is µ-convex for some µ 0, and {fi}n i=1 have δ-SOD for some δ > 0. Let λ 2δ. Then, for any R 1, it holds that f( x R) f µD2 4δ )R 1] 2δD2 where x R := arg minx {x1,...,x R} f(x). To ensure that f( x R) f ε for any given ε > 0, it suffices to set R = Θ δ + µ µ log 1 + µD2 where D := x0 x . Furthermore, the total number of communication rounds spent inside the rand k-loops since the start of the algorithm and up to the moment x R has been computed is k=0 (kr + 1) O R + log 2δ Algorithm 3 S-DANE with line search 1: Input: λ > 0, µ 0, x0 = v0 Rd. Let hi := f fi. 2: Set λ0,0 = λ. 3: for r = 0, 1, 2, . . . do 4: for k = 0, 1, . . . do 5: for each device i [n] in parallel do 6: xi,r+1,k arg minx Rd Fi,r,k(x) := fi(x) + hi(vr), x + λr,k 7: (stop running the local solver once Fi,r,k(xi,r+1,k) λr,k 2 xi,r+1,k vr ) 8: Aggregate local models: xr+1,k = 1 n Pn i=1xi,r+1,k. i=1 fi(xi,r+1,k) + hi(xr+1,k), vr xi,r+1,k 1 2λr,k 1 i=1 fi(xi,r+1,k) 2 then 10: kr = k and break the loop. 11: λr,k+1 = 2λr,k. 12: λr = λr,kr, xi,r+1 = xi,r+1,kr, xr+1 = xr+1,kr, λr+1,0 = 1 2λr. 13: vr+1 = arg minx Rd 1 n Pn i=1[ fi(xi,r+1), x + µ 2 x xi,r+1 2] + λr Proof. According to Lemma 15 (with xi = xi,r+1, v = vr, S = [n] and x S = xr+1) and our requirement on Fi,r(xi,r+1) , whenever λr,k 2δ, we can estimate i=1 fi(xi,r+1) + hi(xr+1), vr xi,r+1 1 2λr,k i=1 fi(xi,r+1) 2 i=1 vr xi,r+1 2 1 λr,k 1 n i=1 Fi,r(xi,r+1) 2 i=1 vr xi,r+1 2 0. Hence, at any iteration of the r-loop, the corresponding k-loop eventually terminates. Further, since λ0,0 2δ, we can easily prove by induction that the quantities λr,k stay reasonably bounded: λr,0 2δ, λr λr,kr 4δ =: λmax, r 0. (E.1) Proceeding exactly in the same way as in the proof of Lemma 16 and using the termination condition of the k-loop, we conclude that, for any r 0, it holds that 1 λr [f(xr+1) f ] + 1 + µ/λr 2 vr+1 x 2 1 In view of (E.1), this means that, for any r 0, 1 λmax [f(xr+1) f ] + 1 + µ/λmax 2 vr+1 x 2 1 Following the same proof as in Appendix C.2 but with λ replaced by λmax, we obtain the first two claims. It remains to estimate the total number of communication rounds required to construct the point x R. In order to carry out the k-loop, the server needs to compute f(vr) and send this vector, as well as vr and λr,0, to each client, which requires O(1) communication rounds. Every iteration of the k-loop also requires O(1) communication rounds, and the total number of such iterations is kr. Thus, every iteration of the r-loop requires O(kr + 1) communication rounds. Furthermore, during the corresponding rounds, the server may also additionally compute the function value f(xr+1) needed for updating the output point xr+1; this could be done, e.g., inside the k-loop, alongside with the computation of the gradient f(xr+1,k) needed to evaluate the if condition. Thus, x R can be indeed computed after O(1) PR 1 r=0 (kr + 1) communication rounds. To estimate the latter sum, observe that, by construction, for any r 0, we have λr+1,0 1 2λr,kr = 2kr 1λr,0. Taking logarithms, we see that kr = 1 + log2 λr+1,0 λr,0 for any r 0. Thus, r=0 (kr + 1) = 2 + log2 λr+1,0 = 2R + log2 λR,0 λ0,0 2R + log2 2δ where the final inequality is due to (E.1) and our choice of λ0,0. Algorithm 4 ACC-S-DANE with line search 1: Input: λ > 0, µ 0, x0 = v0 Rd. Let hi = f fi. 2: Set A0 = 0, B0 = 1, λ0,0 = λ. 3: for r = 0, 1, 2, . . . do 4: for k = 0, 1, . . . do 5: Find ar+1,k > 0 from the equation λr,k = (Ar+ar+1,k)Br a2 r+1,k . Set Ar+1,k = Ar + ar+1,k. 6: yr,k = Ar Ar+1,k xr + ar+1,k Ar+1,k vr. 7: for each device i [n] in parallel do 8: xi,r+1,k arg minx Rd Fi,r,k(x) := fi(x) + hi(yr,k), x + λr,k 2 x yr,k 2 . 9: (stop running the local solver once Fi,r,k(xi,r+1,k) λr,k 2 xi,r+1,k yr,k ) 10: Aggregate local models: xr+1,k = 1 n Pn i=1xi,r+1,k. i=1 fi(xi,r+1,k) + hi(xr+1,k), yr,k xi,r+1,k 1 2λr,k 1 i=1 fi(xi,r+1,k) 2 then 12: kr = k and break the loop. 13: λr,k+1 = 2λr,k. 14: λr = λr,kr, xi,r+1 = xi,r+1,kr, xr+1 = xr+1,kr, ar+1 = ar+1,kr, λr+1,0 = 1 2λr. 15: Ar+1 = Ar + ar+1, Br+1 = Br + µar+1. 16: vr+1 = arg minx Rd ar+1 n Pn i=1[ fi(xi,r+1), x + µ 2 x xi,r+1 2] + Br Theorem 28. Consider Algorithm 4. Suppose that each function fi is µ-convex for some µ 0, and {fi}n i=1 have δ-SOD for some δ > 0. Let λ 2δ. If µ 16δ, then, for any R 1, it holds that f(x R) f 2µD2 (1 + p µ 16δ )R (1 p µ 16δ )R 2 8δD2 where D := x0 x . Otherwise, f(x R) f 8δD2 (1+ µ 16δ )2(R 1) for any R 1. To ensure that f(x R) f ε for any given ε > 0, it suffices to set min{µ, δ}D2 Furthermore, the total number of communication rounds spent inside the rand k-loops since the start of the algorithm and up to the moment x R has been computed is k=0 (kr + 1) O R + log 2δ Proof. Using the same reasoning as in the proof of Theorem 27, we can justify that, at any iteration of the r-loop, the corresponding k-loop eventually terminates, and λr,k stays uniformly bounded as in (E.1). Next, we proceed in the same way as in the proof of Lemma 23 and use the termination condition of the k-loop to obtain that, for any r 0, Ar+1[f(xr+1) f ] + Br+1 2 vr+1 x 2 Ar[f(xr) f ] + Br This shows that, for any R 1, f(x R) f D2 To estimate the rate of growth of AR, we use the equation for ar+1 ar+1,kr and the bound on λr from (E.1). This gives us, for any r 0, the following inequality: a2 r+1 = λr λmax, where Br 1 + µAr. Invoking Lemma 12, we get a lower bound on AR, and the first claim follows. The bound on R via ε can be justified by the same argument as in the proof of Corollary 24. To estimate the total number of communication rounds, we can follow exactly the same argument as in the proof of Theorem 27. 0 1000 2000 3000 4000 5000 6000 Communication rounds 0 10000 20000 30000 40000 50000 60000 Total number of gradient oracle calls DANE-GD S-DANE-GD (ours) Figure E.1: Comparison of S-DANE against DANE for solving a convex quadratic minimization problem with the same number of local steps. Algorithm 5 S-DANE (DL) 1: Input: λ > 0, η > 0, γ [0, 1], x0 = v0 Rd, s [n] 2: for r = 0, 1, 2 . . . do 3: Sample Sr [n] s uniformly at random without replacement 4: for each device i Sr in parallel do 5: Set xi,r+1 arg minx Rd Fi,r(x) , where 6: Fi,r(x) := fi(x) x, fi(vr) f Sr(vr) + λ 2 x vr 2. (option 1) Fi,r(x) := fi(x) + λ 2 x vr 2. (option 2) 8: Set xr+1 = 1 i Srxi,r+1 9: Set vr+1 = γxr+1 + (1 γ)vr η 1 i Sr fi(xi,r+1) F Additional Details on Experiments F.1 Convex Quadratics We generate random vectors {bi,j} and diagonal matrices {Ai,j} in the same way as in [22] such that maxi,j{ Ai,j } = 100 and δ 5. We use n = 10, m = 5 and d = 1000. We compare S-DANE and ACC-S-DANE with DANE. We use the standard gradient descent (GD) with constant stepsize 1 200 1 2L for all three methods as the local solver, where L is the smoothness constant of each fi. We use λ = 5 for all three methods. We use the stopping criterion Fi,r(xi,r+1) λ 2 xi,r+1 vr for our methods (vr becomes yr for the accelerated method). We use Fi,r(xi,r+1) λ r+1 xi,r+1 xr for DANE. F.2 Deep Learning Tasks We simulate the experiment on one NVIDIA DGX A100. We split the training dataset into n = 10 parts according to the Dirichlet distribution with α = 0.5. We use SGD with a batch size of 512 as a local solver for each device. For all the methods considered in Figure 3, we choose the best number of local steps among {10, 20, . . . 80} (for SCAFFNEW, this becomes the inverse of the probability) and the best learning rate among {0.02, 0.05, 0.1}. For this particular task, it is often observed that using control variate makes the training less efficient [38]. The possible issue comes from the fact that local smoothness is often much smaller than local dissimilarity for this task. We here remove the control variate term on line 6 in S-DANE which is defined as x, fi(vr) f Sr(vr) . Moreover, if we write the explicit formula for vr+1 on line 8, it becomes vr+1 = γxr+1 + (1 γ)vr η 1 i Sr fi(xi,r+1) with γ [0, 1] and η > 0. We set γ = 0.99 and η to be the local learning rate in our experiment. The method can be found in Algorithm 5. Note that the only difference between it and FEDPROX is the choice of the prox-center. The best number of local steps for the algorithms without using control variates is 70 while for the others is 10 (otherwise, the training loss explodes). F.3 Implementation To implement Algorithms 1 and 2 (Algorithm 5 with option 2 is the same as Algorithm 1) , each device has the freedom to employ any efficient optimization algorithm, depending on its computation power and the local data size. At each communication round r, these local algorithms are called to approximately solve the sub-problems defined by {Fi,r}, until the gradient norm satisfies a certain accuracy condition that is stated in the corresponding theorems. The server only needs to perform basic vector operations. Note that Gr defined in those algorithms has a unique solution so that vr+1 can be explicitly derived (in the same form as line 9 in Algorithm 5). G Impact Statement This paper presents work that aims to advance the field of distributed Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: In the abstract and the introduction, we first stately clearly what are goals of this study, which is first to achieve the best-known communication complexity and then to maintain the overall computation efficiency. We then discussed the performance of the previous state-of-the-art algorithms. Finally, we compared our methods with them (for instance in Table 1) and showed what contribution and scope we made in this paper. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In Section 7, we discuss the limitations of our work, in terms of the slightly stronger assumption on µ-convexity and the lack of non-convex analysis. On top of that, we also discuss potential future works building on our results. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We make clear assumptions and provide complete and concise proofs in the Appendix for all the claims written in the main text. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We disclose all the necessary information to reproduce our experimental results in Section 6 and Appendix F. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the github link to the code. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/public/ guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https://nips.cc/public/ guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The details of the experiments to reproduce our experimental results can be found in Section 6 and Appendix F. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: Two sets of our experiments are defined in a totally deterministic setting (no randomness.) We ran the other two sets of experiments three times with different randomness seeds. The results are almost indistinguishable. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide the information on the computing resources at the start of Section F for the deep learning experiment. The other experiments are run on a Mac Book Pro laptop. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have read and understood Code of Ethics and we have conducted our research in accordance with it Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We provide an impact statement in Appendix G. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Our resesarch does not have any risks for misuse. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We properly cite and credit the original owners of the open-source datasets LIBSVM and CIFAR 10 in the paper. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not release any new assets in the paper. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.