# the_power_of_extrapolation_in_federated_learning__1f08384c.pdf The Power of Extrapolation in Federated Learning Hanmin Li Gen AI Center of Excellence KAUST, Saudi Arabia hanmin.li@kaust.edu.sa Kirill Acharya Gen AI Center of Excellence KAUST, Saudi Arabia acharya.kk@phystech.edu Peter Richtárik Gen AI Center of Excellence KAUST, Saudi Arabia peter.richtarik@kaust.edu.sa We propose and study several server-extrapolation strategies for enhancing the theoretical and empirical convergence properties of the popular federated learning optimizer Fed Prox [Li et al., 2020]. While it has long been known that some form of extrapolation can help in the practice of FL, only a handful of works provide any theoretical guarantees. The phenomenon seems elusive, and our current theoretical understanding remains severely incomplete. In our work, we focus on smooth convex or strongly convex problems in the interpolation regime. In particular, we propose Extrapolated Fed Prox (Fed Ex Prox), and study three extrapolation strategies: a constant strategy (depending on various smoothness parameters and the number of participating devices), and two smoothness-adaptive strategies; one based on the notion of gradient diversity (Fed Ex Prox-Gra DS), and the other one based on the stochastic Polyak stepsize (Fed Ex Prox-Sto PS). Our theory is corroborated with carefully constructed numerical experiments. 1 Introduction Federated learning (FL) is a distributed training approach for machine learning models, where multiple clients collaborate under the guidance of a central server to optimize a loss function [Koneˇcný et al., 2016, Mc Mahan et al., 2017]. This method allows clients to contribute to model training while keeping their data private, as it avoids the need for direct data sharing. Often, federated optimization is formulated as the minimization of a finite-sum objective function, where each fi : Rd 7 R is the empirical risk of model x associated with the i-th client. The federated averaging method (Fed Avg) is among the most favored strategies for addressing federated learning problems, as proposed by Mc Mahan et al. [2017], Mangasarian and Solodov [1993]. In Fed Avg, the server initiates an iteration by selecting a subset of clients for participation in a given round. Each chosen client then proceeds with local training, employing gradient-based techniques like gradient descent (GD) or stochastic gradient descent (SGD) with random reshuffling, as discussed by Bubeck et al. [2015], Gower et al. [2019], Moulines and Bach [2011], Sadiev et al. [2022]. Kirill was a student at MIPT during his internship at KAUST. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Li et al. [2020] proposed replacing the local training of each client via SGD in Fed Avg with the computation of a proximal term, resulting in the Fed Prox algorithm. i=1 proxγfi (xk) , (2) where γ > 0 is the step size, and the proximal operator is defined as proxγfi (x) := arg min z Rd Contrary to gradient-based methods like GD and SGD, algorithms based on proximal operation, such as proximal point method (PPM) [Rockafellar, 1976, Parikh et al., 2014] and stochastic proximal point methods (SPPM) [Asi and Duchi, 2019, Bertsekas, 2011, Khaled and Jin, 2022, Patrascu and Necoara, 2018, Richtárik and Takác, 2020] benefit from stability against inaccuracies in learning rate specification [Ryu and Boyd, 2014]. Indeed, for GD and SGD, a step size that is excessively large can result in divergence of the algorithm, whereas a step size that is too small can significantly deteriorate the convergence rate of the algorithm. PPM was formally introduced and popularized by the seminal paper of Rockafellar [1976] to solve the variational inequality problems. In practice, the stochastic variant SPPM is more frequently used. It is known that the proximal operator applied to a proper, closed and convex function can be viewed as the projection to some level set of the same function depending on the value of γ. In particular, if we let each fi be the indicator function of a nonempty closed convex set Xi, then proxγfi ( ) becomes the projection ΠXi( ) onto the set Xi. In this case, Fed Prox in (2) becomes the parallel projection method for convex feasibility problem [Censor et al., 2001, 2012, Combettes, 1997a, Necoara et al., 2019], if we additionally assume A well known fact about the parallel projection method is that its empirical efficiency can often be improved by extrapolation [Combettes, 1997a, Necoara et al., 2019]. This involves moving further along the line that connects the last iterate xk and the average projection point, resulting in the iteration xk+1 = xk + αk i=1 ΠXi (xk) xk where αk 1 defines extrapolation level. Despite the various heuristic rules proposed over the years for setting αk [Bauschke et al., 2006, Censor et al., 2001, Combettes, 1997b], which have demonstrated satisfactory practical performance, it was only recently that the theoretical foundation explaining the success of extrapolation techniques for solving convex feasibility problems was unveiled by Necoara et al. [2019], where the authors considered randomized version of (3) named Randomized Projection Method (RPM). The practical success of extrapolation has spurred numerous extensions of existing algorithms. Notably, Jhunjhunwala et al. [2023] combined Fed Avg with extrapolation, resulting in Fed Ex P, leveraging insights from an effective heuristic rule [Combettes, 1997b] for setting αk as follows: αk = Pn i=1 xk ΠXi(xk) 2 Pn i=1 (xk ΠXi(xk)) 2 . (4) However, the authors did not consider the case of a constant extrapolation parameter, nor did they disclose the relationship between the extrapolation parameter and the stepsize of SGD. The extrapolation parameter can be viewed as a server side stepsize in the context of federated learning, its effectiveness was discussed by Malinovsky et al. [2023]. In the field of fixed point methods, extrapolation is also known as over-relaxation [Rechardson, 1911]. It is a technique used to effectively accelerate the convergence of fixed point methods, including gradient algorithms and proximal splitting algorithms [Iutzeler and Hendrickx, 2019, Condat et al., 2023]. 1.1 Contributions Our paper contributes in the following ways; for the notations used please refer to Appendix A. Based on the insights gained from the convex feasibility problem, we extend Fed Prox to its extrapolated counterpart Fed Ex Prox for both convex and strongly1 convex interpolation problems (See Table 1). By optimally setting the constant extrapolation parameter, we obtain iteration complexity O Lγ(1+γLmax) 2 in the convex case and O Lγ(1+γLmax) in the strongly convex case, when all the clients participate in the training (full participation). We reveal the dependence of the optimal extrapolation parameter on smoothness, indicating that simply averaging the iterates from local training on the server is suboptimal. Instead, extrapolation should be applied to achieve faster convergence. Specifically, compared to Fed Prox with the same step size γ, our method is always at least 2 + 1 γLmax + γLmax times better in terms of iteration complexity, see Remark 5. Our method, Fed Ex Prox, improves upon the worst-case iteration complexity O Lmax Fed Ex P [Jhunjhunwala et al., 2023] to O Lγ(1+γLmax) ϵ (See Table 2). The improvement could lead to acceleration up to a factor of n, see Remark 6. Furthermore, we extend Fed Ex Prox to client partial participation setting, showing the dependence of optimal extrapolation parameter on τ which is the number of clients participating in the training and the benefits of a larger τ. In particular, we show that compared to the single client setting, with complexity O Lmax ϵ , the full participation version enjoys a speed-up up to a factor of n, see Remark 7. Our theory uncovers the relationship between the extrapolation parameter and the step size in typical gradient-type methods, leveraging the power of the Moreau envelope. We also recover RPM of Necoara et al. [2019] as a special case in our analysis (see Remark 12), and show that the heuristic outlined in (4), is in fact a step size based on gradient diversity [Horváth et al., 2022, Yin et al., 2018] for the Moreau envelopes of client functions. Building on the insights from Horváth et al. [2022], we propose two adaptive rules for determining the extrapolation parameter: based on gradient diversity (Fed Ex Prox-Gra DS), and the stochastic Polyak step size (Fed Ex Prox-Sto PS) [Horváth et al., 2022, Loizou et al., 2021]. The proposed methods eliminate reliance on the unknown smoothness constant and exhibit semi-adaptivity , meaning the algorithm converges with any local step size γ and by selecting a sufficiently large γ, we ensure that we lose at most a factor of 2 in iteration complexity. We validate our theory with numerical experiments. Numerical evidence suggests that Fed Ex Prox achieves a 2 or higher speed-up in terms of iteration complexity compared to Fed Prox and improved performance compared to Fed Ex P. The framework and the plots are included in the Appendix. 1.2 Related work Stochastic gradient descent. SGD [Robbins and Monro, 1951, Ghadimi and Lan, 2013, Gower et al., 2019, Gorbunov et al., 2020] stands as a cornerstone algorithm utilized across the fields of machine learning. In its simplest form, the algorithm is written as xk+1 = xk η g(xk), where η > 0 is a scalar step size, g(xk) represents a stochastic estimator of the true gradient f(xk). We recover GD when g(xk) = f(xk). The evolution of SGD has been marked by significant advancements since its introduction by Robbins and Monro [1951], leading to various adaptations like stochastic batch gradient descent [Nemirovski et al., 2009] and compressed gradient descent [Alistarh et al., 2017, Khirirat et al., 2018]. Gower et al. [2019] presented a framework for analyzing SGD with arbitrary sampling strategies in the convex setting based on expected smoothness, which was later extended by Gorbunov et al. [2020] to the case of local SGD. While many methods have been crafted to leverage the stochastic nature of g(xk), substantial research efforts are also dedicated to finding a 1Strongly convex: f is µ-strongly convex. 2As we later see in Theorem 1, here Lmax = maxi [n] Li, where each Li is the smoothness of fi, Lγ is the smoothness constant of M γ = 1 n Pn i=1 M γ fi. Table 1: General comparison of Fed Ex P, RPM and Fed Ex Prox in terms of conditions and convergence. Each entry indicates whether the method has the corresponding feature ( ) or not ( ). We use the sign where a feature is not applicable to the corresponding method. Features Fed Ex P RPMa Fed Ex Prox Does not require interpolation regime Does not require convexityb Acceleration in the strongly convex settingc Does not require smoothnessd Allows for partial participation of clientse Works with constant extrapolation parameter Smoothness and partial participation influence extrapolation Semi-adaptivityf a RPM refers to the randomized projection method of Necoara et al. [2019]. Our method includes it as a special case, see Remark 12 b Convexity: local objective fi is convex, which is the indicator function of the convex set Xi in RPM. c The strong convexity pertains to f, and for RPM, it indicates that the linear regularity condition is satisfied. d Smoothness: fi is Li-smooth. Our algorithm also applies in the non-smooth case; see Appendix F.2. e Jhunjhunwala et al. [2023] provides no convergence guarantee for client partial participation setting. f The concept of semi-adaptivity is explained in Remark 9. Table 2: Comparison of convergence of Fed Ex P, Fed Prox, Fed Ex Prox, Fed Ex Prox-Gra DS and Fed Ex Prox-Sto PS. The local step size of Fed Ex P is set to be the largest possible value 1/6t L in the full batch case, where t is the number of local iterations of GD performed. We assume the assumptions of Theorem 1 also hold here. The notations are introduced in Theorem 1 and Theorem 2. The convergence for our methods are described for arbitrary γ > 0. We use K to denote the total number of iterations. For Fed Ex Prox, optimal constant extrapolation is used. The O ( ) notation is hidden for all complexities in this table. Full Participation Method General Case Best Case Worst Case Fed Ex P 6Lmax/PK 1 k=0 αk,P a 6Lmax/PK 1 k=0 αk,P 6Lmax/K Fed Prox (1+γLmax)/γ(2 γLγ)K b Lmax/nγLγ(2 γLγ)K Lmax/γLγ(2 γLγ)K Fed Ex Prox (New) Lγ(1+γLmax)/K c Lmax/n K Lmax/K Fed Ex Prox-Gra DS (New) (1+γLmax)/γ PK 1 k=0 αk,G d (1+γLmax)/γ PK 1 k=0 αk,G (1+γLmax)/γK Fed Ex Prox-Sto PS (New) (1+γLmax)/γ PK 1 k=0 αk,S e (1+γLmax)/γ PK 1 k=0 αk,S (1+γLmax)/γK a The αk,P here is determined according to the theory of Jhunjhunwala et al. [2023]. b Notice that we always have γLγ < 1, so the complexity of Fed Prox is strictly worse than Fed Ex Prox. c We have Lγ (1 + γLmax) Lmax, see Remark 6. d We leave out a factor of (1+γLmax)/(2+γLmax) which is a constant between ( 1 e See Remark 11 for a lower bound of αk,S, using which we can rewrite the rate as Lγ(1+γLmax) better stepsize. An illustration of this is the coordinate-wise adaptive step size Adagrad [Duchi et al., 2011]. Another approach involves employing matrix step size, as demonstrated by Safaryan et al. [2021], Li et al. [2023, 2024]. Our analysis builds on the theory of SGD mainly adapted from Gower et al. [2019] with additional consideration on the upper bound of the step size. Stochastic proximal point method. PPM was first introduced by Rockafellar [1976] to address the problems of variational inequalities at its inception. Its transition to stochastic case, motivated by the need to efficiently solve large scale optimization problems, results in SPPM. It is often assumed that the proximity operator can be computed efficiently for the algorithm to be practical. Over the years, SPPM has been the subject of extensive research, as documented by Bertsekas [2011], Bianchi [2016], Patrascu and Necoara [2018]. Unlike traditional gradient-based methods, SPPM is more robust to inaccuracies in learning rate specifications, as demonstrated by Ryu and Boyd [2014]. Asi and Duchi [2019] studied APROX, which includes SPPM as the special case using the full proximal model; APROX was later extended into minibatch case by Asi et al. [2020]. However, this extension was based on model averaging rather than iterate averaging. The convergence rate of SPPM has been analyzed in various contexts by Khaled and Jin [2022], Ryu and Boyd [2014], Yuan and Li [2022], revealing that its performance does not surpass that of SGD in non-convex regimes. Projection onto convex sets. The projection method originated from efforts to solve systems of linear equations or linear inequalities [Kaczmarz, 1937, Von Neumann, 1949, Motzkin and Schoenberg, 1954]. Subsequently, it was generalized to address the convex feasibility problem [Combettes, 1997b]. Typically, the method involves projecting onto a set Xi, where i is determined through sampling or other strategies. A particularly relevant method to our paper is the parallel projection method, in which individual projections onto the sets are performed in parallel, and their results are averaged in order to produce the next iterate. It is well-established experimentally that the parallel projection method can be accelerated through extrapolation, with numerous successful heuristics having been proposed to adaptively set the extrapolation parameter [Bauschke et al., 2006, Pierra, 1984]. However, only recently a theory was proposed by Necoara et al. [2019] to explain this phenomenon. Necoara et al. [2019] introduced stochastic reformulations of the convex feasibility problem and revealed how the optimal extrapolation parameter depends on the smoothness of the setting and the size of the minibatch. A better result under a linear regularity condition, which is connected to strong convexity, was also obtained. However, the explanation provided by Necoara et al. [2019] was not satisfactory, as it failed to clarify why adaptive rules based on gradient diversity are effective. Moreau envelope. The concept of the Moreau envelope, also known as Moreau-Yosida regularization, was first introduced by Moreau [1965] as a mathematical tool for handling non-smooth functions. A particularly relevant property of the Moreau envelope is that executing proximal minimization algorithms on the original objective is equivalent to applying gradient methods to its Moreau envelope [Ryu and Boyd, 2014]. Based on this observation, Davis and Drusvyatskiy [2019] conducted an analysis of several methods, including SPPM for weakly convex and Lipschitz functions. The properties of the Moreau envelope and its applications have been thoroughly investigated in many works including Jourani et al. [2014], Planiden and Wang [2016, 2019]. Beyond its role in proximal minimization algorithms, the Moreau envelope has been utilized in the contexts of personalized federated learning [T Dinh et al., 2020] and meta-learning [Mishchenko et al., 2023]. Adaptive step size. One of the most crucial hyperparameters in training machine learning models with gradient-based methods is the step size. For GD and SGD, determining the step size often depends on the smoothness parameter, which is typically unknown, posing challenges in practical step size selection. There has been a growing interest in adaptive step sizes, leading to the development of numerous adaptive methods that enable real-time computation of the step size. Examples include Adagrad [Duchi et al., 2011], RMSProp [Hinton et al.], and ADAM [Kingma and Ba, 2015]. Recently, several studies have attempted to extend the Polyak step size beyond deterministic settings, leading to the development of the stochastic Polyak step size [Richtárik and Takác, 2020, Horváth et al., 2022, Loizou et al., 2021, Orvieto et al., 2022]. Gradient diversity, first introduced by Yin et al. [2018], was subsequently analyzed theoretically by Horváth et al. [2022]. 2 Preliminaries We now introduce the several definitions and assumptions that are used throughout the paper. Definition 1 (Proximity operator). The proximity operator of an extended-real-valued function ϕ : Rd 7 R {+ } with step size γ > 0 is defined as proxγϕ (x) := arg min z Rd It is known that for a proper, closed and convex function ϕ, the minimizer of ϕ(z) + 1 2γ z x 2 exists and is unique. Throughout this paper, we assume that the proximal operators are evaluated exactly, with no approximation or inexactness. Algorithm 1 Extrapolated SPPM (Fed Ex Prox) with partial client participation 1: Parameters: extrapolation parameter αk > 0, step size for the proximity operator γ > 0, starting point x0 Rd, number of clients n, total number of iterations K, number of clients participate in the training τ, for simplicity, we use τ-nice sampling as an example 2: for k = 0, 1, 2 . . . K 1 do 3: The server samples Sk {1, 2, . . . , n} uniformly from all subsets of cardinality τ 4: The server computes xk+1 = xk + αk i Sk proxγfi (xk) xk Definition 2 (Moreau envelope). The Moreau envelope of an extended-real-valued function ϕ : Rd 7 R {+ } with step size γ > 0 is defined as M γ ϕ (x) := min z Rd The following assumptions are used in our analysis. We use the notation [n] for the set {1, . . . , n}. Assumption 1 (Differentiability). The function fi in (1) is differentiable for all i [n]. Assumption 2 (Interpolation regime). There exists x Rd such that fi(x ) = 0 for all i [n]. Note that Assumption 2 indicates that each fi and f are lower bounded. In this paper, we focus on cases where the interpolation regime holds. This assumption often holds in modern deep learning which are overparameterized where the number of parameters greatly exceeds the number of data points, as justified by Arora et al. [2019], Montanari and Zhong [2022]. Our motivation for this assumption partly arises from the convex feasibility problem [Combettes, 1997a, Necoara et al., 2019], wherein the intersection X is presumed nonempty. This is equivalent to assuming that the interpolation regime holds when fi is the indicator function of the nonempty closed convex set Xi. Further motivations derived from the proof for this assumption will be discussed later. Assumption 3 (Convexity). The function fi : Rd 7 R is convex for all i [n]. This means that for each fi, 0 fi(x) fi(y) fi(y), x y , x, y Rd. (5) Assumption 4 (Smoothness). Function fi : Rd 7 R is Li-smooth, Li > 0 for all i [n]. This means that for each fi, fi(x) fi(y) fi(y), x y Li 2 x y 2 , x, y Rd. (6) We will use Lmax to denote maxi [n] Li. It is important to note that the smoothness assumption here is not necessary to obtain a convergence result, see Appendix F.2 for the detail. We introduce this assumption to highlight how the optimal extrapolation parameter depends on smoothness if it is present. The following strong convexity assumption is introduced that, if adopted, enables us to achieve better results. Assumption 5 (Strong convexity). The function f is µ-strongly convex, µ > 0. That is f(x) f(y) f(y), x y µ 2 x y 2 , x, y Rd. We first present our algorithm Fed Ex Prox as Algorithm 1. In the subsequent sections, we first present the theory in the stochastic setting for Fed Ex Prox with a fixed extrapolation parameter in Section 3. Then we proceed to adaptive versions of our algorithm which eliminates the dependence on the unknown smoothness constant in Section 4. 3 Constant extrapolation In order to demonstrate the convergence result of our algorithm in the stochastic setting, we use τ-nice sampling as the way of selecting clients for partial participation. This refers to that in each iteration, the server samples a set Sk {1, 2, . . . , n} uniformly at random from all subsets of size τ. We want to emphasize that the sampling strategy here is merely an example, it is possible to use other client sampling strategies. Theorem 1. Suppose Assumption 1 (Differentiability), Assumption 2 (Interpolation regime), Assumption 3 (Convexity) and Assumption 4 (Smoothness) hold. If we use a fixed extrapolation parameter αk = α 0, 2 γLγ,τ and any step size 0 < γ < + , then the average iterate of Algorithm 1 satisfies E [f( x K)] inf f C (γ, τ, α) x0 x 2 where K is the number of iteration, x K is sampled uniformly at random from the first K iterates {x0, x1, . . . , x K 1}, C (γ, τ, α) is defined as C (γ, τ, α) := 1 + γLmax αγ (2 αγLγ,τ) and Lγ,τ := n τ τ(n 1) Lmax 1 + γLmax + n(τ 1) where Lmax = maxi Li, Lγ is the smoothness constant of M γ (x) := 1 n Pn i=1 M γ fi (x). If we fix γ and τ the optimal constant extrapolation parameter is given by αγ,τ := 1 γLγ,τ > 1, which results in the following convergence guarantee: E [f( x K)] inf f C(γ, τ, αγ,τ) x0 x 2 K = Lγ,τ (1 + γLmax) x0 x 2 The proof of this theorem relies on the reformulation of the update rule in (7), using the identity M γ fi (x) = 1 γ x proxγfi (x) given in Lemma 2, which holds for any x Rd, into the following form: xk+1 = xk αk γ 1 i Sk M γ fi (xk) . (8) We can then apply our modified theory for SGD given in Theorem 3, which is adapted from Gower et al. [2019], to obtain function value suboptimality in terms of M γ (x). The results are then translated back to function value suboptimality in terms of f. Note that (8) unveils the connection between the step size of gradient type methods and extrapolation parameter in our case. Remark 1. Theorem 1 provides convergence guarantee for Algorithm 1 in the convex case. If in addition, we assume Assumption 5 (Strong convexity) holds, the rate can be improved and we obtain linear convergence. See Corollary 1 for the details. Remark 2. Theorem 1 indicates convergence for any 0 < γ < + . Indeed, as it is proved by Lemma 7, we have C (γ, τ, αγ,τ) = Lγ,τ (1 + γLmax) Lmax holds for any 0 < γ < + . In cases where there exists at least one Li < Lmax, we have C (γ, τ, αγ,τ) < Lmax. Remark 3. One may question the necessity of the interpolation regime assumption. This assumption is crucial to our analysis. Besides allowing us to revisit the convex feasibility problem setting, it also guarantees that M γ (x) has the same set of minimizers as f(x) as illustrated by Lemma 8. It also allows us to improve the upper bound on the step size by a factor of 2 in the SGD theory, which is demonstrated in Theorem 3 in the Appendix. Remark 4. From the reformulation presented in (8), we see the best extrapolation parameter is obtained when αkγ is the best step size for SGD running on global objective M γ (x). Since the best step size is affected by the smoothness and the minibatch size, so is the best extrapolation parameter. We can also compare our algorithm with Fed Prox in the convex overparameterized regime. Remark 5. Our algorithm includes Fed Prox as a special case when α = 1. To recover its result, we simply plug in α = 1, the resulting condition number is C(γ, τ, 1) = 1+γLmax γ(2 γLγ,τ ). Compared to Fed Prox, Algorithm 1 with the same γ > 0 demonstrates superior performance, with the acceleration factor being quantified by C(γ, τ, 1) C (γ, τ, αγ,τ) 2 + 1 γLmax + γLmax 4. See Lemma 14 for the proof. This suggests that the approach of the server averaging all iterates following local computation is suboptimal. In the following paragraphs, we study some special cases, Full participation case For the full participation case (τ = n), using definition from Theorem 1 αγ,n = 1 γLγ > 1, Lγ,n = Lγ, C (γ, n, αγ,n) = Lγ (1 + γLmax) Lmax. (9) In this case, we can compare our method with Fed Ex P in the convex overparameterized setting. Remark 6. Assume the conditions in Theorem 1 hold, the worst case iteration complexity of Fed Ex P is given by O Lmax ϵ , while for Algorithm 1, it is O C(γ,n,αγ,n) ϵ . As suggested by Lemma 7, Algorithm 1 has a better iteration complexity (C (γ, n, αγ,n) < Lmax) whenever there exists Li = Lmax for some i [n], and the acceleration could reach up to a factor of n as suggested by Example 1. In general, the speed-up in the worst case is quantified by Lmax 1 + γLmax Lmax C (γ, n, αγ,n) n Lmax 1 + γLmax Single client case For the single client case (τ = 1), using definition from Theorem 1 αγ,1 = 1 + 1 γLmax > 1, Lγ,1 = Lmax 1 + γLmax , C (γ, 1, αγ,1) = Lmax. Remark 7. Compared with full and partial client participation, the following relations hold for any τ [n], C (γ, n, αγ,n) C (γ, τ, αγ,τ) C (γ, 1, αγ,1) and αγ,1 αγ,τ αγ,n, τ [n]. Since the iteration complexity of Fed Ex Prox is given by O C(γ,τ,αγ,τ ) ϵ , the above inequalities tell us a larger client minibatch size τ leads to a larger extrapolation and a better iteration complexity. Specifically, Lemma 7 suggests the improvement over the single client case could be as much as a factor of n (C (γ, n, αγ,n) = 1 n C (γ, 1, αγ,1)) as suggested by Example 1. 4 Adaptive extrapolation Observe that in Theorem 1, in order to determine the optimal extrapolation, we require the knowledge of Lγ,τ, which is typically unknown. Although theoretically it suggests that simply averaging the iterates may result in suboptimal performance, in practice, this implication is less significant. To address this issue, we introduced two variants of Fed Ex Prox, based on gradient diversity and stochastic Polyak step size, given their relation to the extrapolation parameter in our cases. Theorem 2. Suppose Assumption 1 (Differentiability), Assumption 2 (Interpolation regime), Assumption 3 (Convexity) and Assumption 4 (Smoothness) hold. (i) (Fed Ex Prox-Gra DS): If we are using αk = αk,G, where 1 n Pn i=1 xk proxγfi (xk) 2 1 n Pn i=1 xk proxγfi (xk) 2 1, (10) then the iterates of Algorithm 1 with τ = n satisfy E [f( x K)] inf f 1 + γLmax 2 + γLmax 1 PK 1 k=0 αk,G , where x K is chosen randomly from the first K iterates {x0, x1, ..., x K 1} with probabilities pk = αk,G/PK 1 k=0 αk,G. (ii) (Fed Ex Prox-Sto PS): If we are using αk = αk,S, where, 1 n Pn i=1 M γ fi (xk) inf M γ fi n Pn i=1 M γ fi (xk) 2 1 2γLγ , (11) then the iterates of Algorithm 1 with τ = n satisfy E [f( x K)] inf f 1 PK 1 k=0 αk,S , (12) where x K is chosen randomly from the first K iterates {x0, x1, ..., x K 1} with probabilities pk = αk,S/PK 1 k=0 αk,S. Theorem 2 describes the convergence in the full participation setting. However, we can also extend it to the stochastic setting by implementing a stochastic version of these adaptive step size rules for gradient-based methods [Horváth et al., 2022, Loizou et al., 2021]. See Theorem 5 in the Appendix for the details. Remark 8. In fact, the adaptive rule based on gradient diversity can be improved by using Lmax 1+γLmax instead of 1 γ as the maximum of local smoothness constant of Moreau envelops, resulting in the extrapolation, αk = α k,G := 1 + γLmax 1 n Pn i=1 xk proxγfi (xk) 2 1 n Pn i=1 xk proxγfi (xk) 2 . (13) One can obtain a slightly better convergence guarantee than the Fed Ex Prox-Gra DS case in Theorem 2, see Corollary 2 in the Appendix. However, the requires the knowledge of Lmax in order to compute 1+γLmax Remark 9. Note that, compared to classical gradient-based methods, Fed Ex Prox-Gra DS benefits from semi-adaptivity . This refers to the fact that the algorithm converges for any choice of γ > 0. Although a smaller γ hinders convergence, setting it to at least 1 Lmax limits the worsening of the convergence to a factor of 2. Remark 10. Compared to Fed Ex Prox with the optimal constant extrapolation parameter, we gain semi-adaptivity here by using the gradient diversity based extrapolation. However, this results in losing the favorable dependence of convergence on Lγ and instead establishes a dependence on Lmax. Remark 11. For Fed Ex Prox-Sto PS, as it is suggested by Lemma 20, the convergence depends on the favorable smoothness constant Lγ, rather than on Lmax. However, this comes at the price of having to know the minimum of each individual Moreau envelope. For a detailed discussion of the adaptive variants of Fed Ex Prox, we refer the readers to Appendix F.5. Since one of our starting points is the RPM by Necoara et al. [2019] to solve the convex feasibility problem with non-smooth local objectives, we have also adapted our method to non-smooth cases, as detailed in Theorem 4 in the Appendix. We also provided a discussion of our method in the non-interpolated setting and in the non-convex setting in Appendix F. Finally, we support our findings with experiments, see Figure 1 for a simple experiment confirming that Fed Ex Prox indeed has a better iteration complexity than Fed Prox. For more details on the experiments, we refer the readers to Appendix I in the Appendix. Notice that in practice, each local proximity operator can be solved using different oracles. Clients may use GD or SGD to solve the local problem to a certain accuracy. The complexity of this subroutine depends on the local stepsize. If γ is large, the local problem becomes harder to solve because we aim to minimize the local objective itself. Conversely, if it is small, the problem is easier since we do not stray far from the current iterate. As the choice of subroutine affects local computation complexity, comparing it directly with Fed Ex P becomes complicated. Therefore, we compare the iteration complexity (number of communication rounds) of the two algorithms, assuming efficient local computations are carried out by the clients. 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.001, αγ,n = 2.23763 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.1, αγ,n = 2.01799 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 10, αγ,n = 2.01576 Fed Prox Fed Ex Prox Figure 1: Comparison of Fed Ex Prox and Fed Prox in terms of iteration complexity in the full participation setting. The notation γ here denotes the local step size of the proximity operator and αγ,n is the corresponding optimal extrapolation parameter computed in (9) in the full participation case. In all cases, our proposed algorithm outperforms Fed Prox, suggesting that the practice of simply averaging the iterates is suboptimal. 5 Conclusion 5.1 Limitations Our analysis of Fed Ex Prox serves as an initial step in adding extrapolation to Fed Prox, which currently relies on the suboptimal practice of the server merely averaging the iterates. While we discuss the behavior of our algorithm in non-interpolated and non-convex scenarios, our analysis only validates the effectiveness of extrapolation under the interpolation regime assumption. 5.2 Future Work As we have just mentioned, extending our method and analysis beyond interpolation and convex regime is intriguing. In this case, new techniques may be needed for variance reduction. It is also interesting to investigate whether extrapolation can be applied together with client-specific personalization. Acknowledgement The research reported in this publication was supported by funding from King Abdullah University of Science and Technology (KAUST): i) KAUST Baseline Research Scheme, ii) Center of Excellence for Generative AI, under award number 5940, iii) SDAIA-KAUST Center of Excellence in Artificial Intelligence and Data Science. The work was done during Kirill Acharya s internship at KAUST. Kirill Acharya was also affiliated with MIPT, ISP RAS, Russia. The work of Kirill Acharya was also supported by Grant App. No. 2 to Agreement No. 075-03-2024-214. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30, 2017. S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322 332. PMLR, 2019. H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257 2290, 2019. H. Asi, K. Chadha, G. Cheng, and J. C. Duchi. Minibatch stochastic approximate proximal point methods. Advances in Neural Information Processing Systems, 33:21958 21968, 2020. H. H. Bauschke, P. L. Combettes, and S. G. Kruk. Extrapolation algorithm for affine-convex feasibility problems. Numerical Algorithms, 41:239 274, 2006. A. Beck. First-order methods in optimization. SIAM, 2017. D. P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163 195, 2011. P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization, 26(4):2235 2260, 2016. A. Böhm and S. J. Wright. Variable smoothing for weakly convex composite functions. Journal of Optimization Theory and Applications, 188:628 649, 2021. S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015. Y. Censor, T. Elfving, and G. Herman. Averaging strings of sequential iterations for convex feasibility problems. In Studies in Computational Mathematics, volume 8, pages 101 113. Elsevier, 2001. Y. Censor, W. Chen, P. L. Combettes, R. Davidi, and G. T. Herman. On the effectiveness of projection methods for convex feasibility problems with linear inequality constraints. Computational Optimization and Applications, 51:1065 1088, 2012. P. L. Combettes. Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections. IEEE Transactions on Image Processing, 6(4):493 506, 1997a. P. L. Combettes. Hilbertian convex feasibility problem: Convergence of projection methods. Applied Mathematics and Optimization, 35(3):311 330, 1997b. L. Condat, D. Kitahara, A. Contreras, and A. Hirabayashi. Proximal splitting algorithms for convex optimization: A tour of recent advances, with new twists. SIAM Review, 65(2):375 435, 2023. D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011. S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. E. Gorbunov, F. Hanzely, and P. Richtárik. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. In International Conference on Artificial Intelligence and Statistics, pages 680 690. PMLR, 2020. R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik. SGD: General analysis and improved rates. In International Conference on Machine Learning, pages 5200 5209. PMLR, 2019. G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. S. Horváth, K. Mishchenko, and P. Richtárik. Adaptive learning rates for faster stochastic gradient methods. ar Xiv preprint ar Xiv:2208.05287, 2022. F. Iutzeler and J. M. Hendrickx. A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optimization Methods and Software, 34(2):383 405, 2019. D. Jhunjhunwala, S. Wang, and G. Joshi. Fed Ex P: Speeding up federated averaging via extrapolation. In International Conference on Learning Representations, 2023. A. Jourani, L. Thibault, and D. Zagrodny. Differential properties of the moreau envelope. Journal of Functional Analysis, 266(3):1185 1237, 2014. S. Kaczmarz. Approximate solution of systems of linear equations. International Journal of Control, 57(6):1269 1271, 1937. A. Khaled and C. Jin. Faster federated optimization under second-order similarity. In The Eleventh International Conference on Learning Representations, 2022. A. Khaled and P. Richtárik. Better theory for SGD in the nonconvex world. Transactions on Machine Learning Research, 2023. S. Khirirat, H. R. Feyzmahdavian, and M. Johansson. Distributed learning with compressed gradients. ar Xiv preprint ar Xiv:1806.06573, 2018. D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, San Diego, CA, USA, 2015. J. Koneˇcný, H. B. Mc Mahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. ar Xiv preprint ar Xiv:1610.05492, 8, 2016. H. Li, A. Karagulyan, and P. Richtárik. Marina meets matrix stepsizes: Variance reduced distributed non-convex optimization. ar Xiv preprint ar Xiv:2310.04614, 2023. H. Li, A. Karagulyan, and P. Richtárik. Det-CGD: Compressed gradient descent with matrix stepsizes for non-convex optimization. In International Conference on Learning Representations, 2024. T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429 450, 2020. N. Loizou, S. Vaswani, I. H. Laradji, and S. Lacoste-Julien. Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pages 1306 1314. PMLR, 2021. G. Malinovsky, K. Mishchenko, and P. Richtárik. Server-side stepsizes and sampling without replacement provably help in federated optimization. In Proceedings of the 4th International Workshop on Distributed Machine Learning, Distributed ML 23, page 85 104, 2023. O. L. Mangasarian and M. V. Solodov. Backpropagation convergence via deterministic nonmonotone perturbed minimization. Advances in Neural Information Processing Systems, 6, 1993. B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273 1282. PMLR, 2017. K. Mishchenko, S. Hanzely, and P. Richtárik. Convergence of first-order algorithms for meta-learning with Moreau envelopes. ar Xiv preprint ar Xiv:2301.06806, 2023. A. Montanari and Y. Zhong. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. The Annals of Statistics, 50(5):2816 2847, 2022. J.-J. Moreau. Proximité et dualité dans un espace Hilbertien. Bulletin de la Société Mathématique de France, 93:273 299, 1965. T. S. Motzkin and I. J. Schoenberg. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 6:393 404, 1954. E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in Neural Information Processing Systems, 24, 2011. I. Necoara, P. Richtárik, and A. Patrascu. Randomized projection methods for convex feasibility: Conditioning and convergence rates. SIAM Journal on Optimization, 29(4):2814 2852, 2019. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009. A. Orvieto, S. Lacoste-Julien, and N. Loizou. Dynamics of SGD with stochastic Polyak stepsizes: Truly adaptive variants and convergence to exact solution. Advances in Neural Information Processing Systems, 35:26943 26954, 2022. N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends in Optimization, 1(3): 127 239, 2014. A. Patrascu and I. Necoara. Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. Journal of Machine Learning Research, 18(198):1 42, 2018. G. Pierra. Decomposition through formalization in a product space. Mathematical Programming, 28: 96 115, 1984. C. Planiden and X. Wang. Strongly convex functions, Moreau envelopes, and the generic nature of convex functions with strong minimizers. SIAM Journal on Optimization, 26(2):1341 1364, 2016. C. Planiden and X. Wang. Proximal mappings and Moreau envelopes of single-variable convex piecewise cubic functions and multivariable gauge functions. Nonsmooth Optimization and Its Applications, pages 89 130, 2019. L. Rechardson. The approximate arithmetical solution by finite difference of physical problems involving differential equations, with an application to the stresses in a masonary dam. R. Soc. London Phil. Trans. A, 210:307 357, 1911. P. Richtárik and M. Takác. Stochastic reformulations of linear systems: algorithms and convergence theory. SIAM Journal on Matrix Analysis and Applications, 41(2):487 524, 2020. H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407, 1951. R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5):877 898, 1976. E. K. Ryu and S. Boyd. Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft, 2014. A. Sadiev, G. Malinovsky, E. Gorbunov, I. Sokolov, A. Khaled, K. Burlachenko, and P. Richtárik. Federated optimization algorithms with random reshuffling and gradient compression. ar Xiv preprint ar Xiv:2206.07021, 2022. M. Safaryan, F. Hanzely, and P. Richtárik. Smoothness matrices beat smoothness constants: Better communication compression techniques for distributed optimization. Advances in Neural Information Processing Systems, 34:25688 25702, 2021. C. T Dinh, N. Tran, and J. Nguyen. Personalized federated learning with Moreau envelopes. Advances in Neural Information Processing Systems, 33:21394 21405, 2020. J. Von Neumann. On rings of operators. reduction theory. Annals of Mathematics, pages 401 485, 1949. D. Yin, A. Pananjady, M. Lam, D. Papailiopoulos, K. Ramchandran, and P. Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In International Conference on Artificial Intelligence and Statistics, pages 1998 2007. PMLR, 2018. Y. Yu, X. Zheng, M. Marchetti-Bowick, and E. Xing. Minimizing nonconvex non-separable functions. In Artificial Intelligence and Statistics, pages 1107 1115. PMLR, 2015. X. Yuan and P. Li. On convergence of Fed Prox: Local dissimilarity invariant bounds, non-smoothness and beyond. Advances in Neural Information Processing Systems, 35:10752 10765, 2022. A Notations 15 B Basic Facts 15 C Properties of Moreau envelope 16 D Technical lemmas 18 E Theory of SGD 19 F Additional analysis on Fed Ex Prox 19 F.1 Fed Ex Prox in the strongly convex case . . . . . . . . . . . . . . . . . . . . . . . 19 F.2 Fed Ex Prox in the non-smooth case . . . . . . . . . . . . . . . . . . . . . . . . . 20 F.3 Discussion on the non-interpolation case . . . . . . . . . . . . . . . . . . . . . . . 21 F.4 Discussion on the non-convex case . . . . . . . . . . . . . . . . . . . . . . . . . . 21 F.5 Additional notes on adaptive variants . . . . . . . . . . . . . . . . . . . . . . . . . 23 F.6 Extension of adaptive variants into client partial participation (PP) setting . . . . . 24 G Missing proofs of theorems and corollaries 25 G.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 G.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 G.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 G.4 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 G.5 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 G.6 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 G.7 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 H Missing proofs of lemmas 36 H.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 H.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 H.3 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 H.4 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 H.5 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 H.6 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 H.7 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 H.8 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 H.9 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 H.10 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 H.11 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 H.12 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 H.13 Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 H.14 Proof of Lemma 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 H.15 Proof of Lemma 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 H.16 Proof of Lemma 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 H.17 Proof of Lemma 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 H.18 Proof of Lemma 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 H.19 Proof of Lemma 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 H.20 Proof of Lemma 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 I Experiments 44 I.1 Experiment settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 I.2 Large dimension regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 I.2.1 Comparison of Fed Ex Prox and Fed Prox . . . . . . . . . . . . . . . . . . 45 I.2.2 Comparison of Fed Ex Prox with different local step size . . . . . . . . . . 47 I.2.3 Comparison of Fed Ex Prox and its adaptive variants . . . . . . . . . . . . 48 A Notations Throughout the paper, we use the notation to denote the standard Euclidean norm defined on Rd and , to denote the standard Euclidean inner product. Given a differentiable function f : Rd 7 R, its gradient is denoted as f(x). For a convex function f : Rd 7 R, we use f(x) to denote its subdifferential at x. We use the notation Df (x, y) to denote the Bregman divergence associated with a function f : Rd 7 R between x and y. The notation inf f is used to denote the minimum of a function f : Rd 7 R. We use proxγf (x) to denote the proximity operator of function f : Rd 7 R with γ > 0 at x Rd, and M γ f (x) to denote the corresponding Moreau Envelope. The notation is used for the infimal convolution of two proper functions. We denote the average of the Moreau envelope of each local objective fi by the notation M γ : Rd 7 R. Specifically, we define M γ (x) = 1 n Pn i=1 M γ f (x). Note that M γ (x) has an implicit dependence on γ, its smoothness constant is denoted by Lγ. We say an extended real-valued function f : Rd 7 R {+ } is proper if there exists x Rd such that f(x) < + . We say an extended real-valued function f : Rd 7 R {+ } is closed if its epigraph is a closed set. The following Table 3 summarizes the commonly used notations and quantities appeared in this paper. B Basic Facts Fact 1 (First prox theorem). [Beck, 2017, Theorem 6.3] Let f : Rd 7 R be a proper, closed and convex function. Then proxf (x) is a singleton for any x Rd. Fact 2 (Second prox theorem). [Beck, 2017, Theorem 6.39] Let f : Rd 7 R {+ } be a proper, closed and convex function. Then for any x, u Rd, the following three claims are equivalent: (i) u = proxf (x). (ii) x u f(u). (iii) x u, y u f(y) f(u) for any y Rd. Fact 3 (Bregman divergence). The Bregman divergence associated with a function f between x, y Rd is defined as, Df (x, y) := f(x) f(y) f(y), x y . (14) If f is convex, then for any x, y Rd Df (x, y) 0. (15) Table 3: Summary of frequently used notations and quantities in this paper. Notations Explanation n The total number of clients. d The dimension of the model. x The model which belongs to Rd. K The total number of iterations. xk The model at k-th iteration. αk The extrapolation parameter at iteration k. fi(x) Each local objective function. γ The step size in the proximity operator. f(x) The global objective f. proxγfi (x) The proximity operator associated with fi and γ > 0 at point x Rd. M γ fi (x) The Moreau envelope associated with fi and γ > 0 at point x Rd. M γ (x) The average of M γ fi (x). Li The smoothness constant of fi. L The smoothness constant of f. µ The strong convexity constant of f. Li/(1+γLi) The smoothness constant of M γ fi Lmax The maximum of all Li, for i [n]. Lmax/(1+γLmax) The maximum of the smoothness constant of each M γ fi for i [n]. Lγ The smoothness constant of M γ. Lγ,τ The interpolation between the Lγ and Lmax/(1+γLmax) induced by τ-nice sampling. αγ,τ The optimal extrapolation parameter of Fed Ex Prox under τ-nice sampling. C (γ, τ, α) The convergence rate of Fed Ex Prox with τ-nice sampling in the convex case. αk,G The gradient diversity extrapolation in the k-th iteration defined in Theorem 2. αk,S The stochastic Polyak extrapolation in the k-th iteration defined in Theorem 2. α k,G The improved gradient diversity based extrapolation used in Corollary 2. Df (x, y) The Bregman divergence associated with f between two points x, y Rd. Sk The set of indices server sampled in the k-th iteration. ατ,k,G The gradient diversity based extrapolation in the k-th iteration for Fed Ex Prox-Gra DS-PP. ατ,k,S The stochastic Polyak based extrapolation in the k-th iteration for Fed Ex Prox-Sto PS-PP. If f is convex, L-smooth and differentiable, the following inequalities hold for any x, y Rd, 1 L f(x) f(y) 2 Df (x, y) + Df (y, x) L x y 2 , 1 L f(x) f(y) 2 2Df (x, y) L x y 2 . (16) Fact 4 (Increasing function). Let f(x) = x 1+γx, where γ > 0. Then f(x) is monotone increasing when x > 0. C Properties of Moreau envelope In this section, we explore the properties of the Moreau envelope of individual functions fi, and the global objective M γ = 1 n Pn i=1 M γ fi. Before that, we present the definition of infimal convolution Definition 3 (Infimal convolution). The infimal convolution of two proper functions f, g : Rd 7 R {+ } is defined via the following formula (f g) (x) = min z Rd {f(z) + g(x z)} . One key observation is that M γ f can be viewed as the infimal convolution of the proper, closed and convex function f and the real-valued convex function 1 2γ 2. This observation enables us to infer the convexity and smoothness of the Moreau envelope from the properties of the original function. First, we present two lemmas about basic properties of Moreau envelope. Lemma 1 (Real-valuedness). Let f : Rd 7 R {+ } be a proper, closed and convex function. Then its Moreau envelope M γ f for any γ > 0 is a real-valued function. In particular, the following identity holds for x Rd according to the definition of Moreau envelope, M γ f (x) = f proxγf (x) + 1 x proxγf (x) 2 . Lemma 2 (Differentiability of Moreau envelope). [Beck, 2017, Theorem 6.60] Let f : Rd 7 R {+ } be a proper, closed and convex function. Then its Moreau envelope M γ f for any γ > 0 is 1 γ -smooth, and for any x Rd, we have M γ f (x) = 1 γ x proxγf (x) . We then focus on the relation between individual fi and M γ fi. The following lemma suggests that the convexity of individual fi guarantees the convexity of M γ fi. Lemma 3 (Convexity of Moreau envelope). [Beck, 2017, Theorem 6.55] Let f : Rd 7 R {+ } be a proper and convex function. Then M γ f is a convex function. It is also true that the smoothness of individual fi indicates the smoothness of M γ fi. Lemma 4 (Smoothness of Moreau envelope). Let f : Rd 7 R be a convex and L-smooth function. Then M γ f is L 1+γL-smooth. One notable fact is that fi and M γ fi have the same set of minimizers. Lemma 5 (Minimizer equivalence). Let f : Rd 7 R {+ } be a proper, closed and convex function. Then for any γ > 0, f and M γ f has the same set of minimizers. In addition, M γ f is a global lower bound of f. Lemma 6 (Individual lower bound). Let f : Rd 7 R {+ } be a proper, closed and convex function. Then the Moreau envelope M γ f satisfies M γ f (x) f(x) for all x Rd. Next, we focus on the global objective M γ (x). The following lemma bounds its smoothness constant from both above and below. Lemma 7 (Global convexity and smoothness). Let each fi be proper, closed convex and Li-smooth. Then M is convex and Lγ-smooth with Li 1 + γLi Lγ 1 Li 1 + γLi . As a result of the above inequalities, we have the following inequality on the condition number defined in Theorem 1 which holds for any τ [n], Lγ (1 + γLmax) = C (γ, n, αγ,n) C (γ, τ, αγ,τ) C (γ, 1, αγ,1) = Lmax. When there exists at least one Li < Lmax, we have C (γ, n, αγ,n) < C (γ, τ, αγ,τ) < Lmax = C (γ, 1, αγ,1). Even Li = Lmax holds for all i [n], there are cases (See Example 1 in the proof.) that C (γ, n, αγ,n) = 1 n C (γ, 1, αγ,1) = 1 A key observation in this case is the generalization of Lemma 5 into the finite-sum setting under the interpolation regime. Lemma 8 (Minimizer equivalence). If we let every fi : Rd 7 R {+ } be proper, closed and convex, then f(x) = 1 n Pn i=1 fi(x) has the same set of minimizers and minimum as M γ (x) = 1 i=1 M γ fi (x) , if we are in the interpolation regime and 0 < γ < . The following lemma generalizes Lemma 6 into the finite-sum setting. Lemma 9 (Global lower bound). Let each fi : Rd 7 R {+ } be proper, closed and convex. Then the following inequality holds for any x Rd and γ > 0, M γ (x) M γ f (x) f(x). In addition, if we assume we are in the interpolation regime, then M γ, M γ f and f have the same set of minimizers, for any x in this set of minimizers, the following identity holds, M γ (x ) = M γ f (x ) = f(x ). Besides the global lower bound provided above, there is also a relation between the function value suboptimality of M γ and f. Lemma 10 (Suboptimality bound). Suppose Assumption 1 (Differentiability), 2 (Interpolation Regime), 3 (Convexity) and 4 (Smoothness) hold, for any minimizer x of M γ (x), all x Rd, the following inequality holds for each client objective, M γ fi (x) M γ fi (x ) 1 1 + γLi (fi(x) fi(x )) . (17) Furthermore, this suggests M γ (x) M γ (x ) 1 1 + γLmax (fi(x) fi(x )) . (18) A direct consequence of the above function suboptimality bound is the star strong convexity of M γ from the strong convexity of f. Lemma 11. (Star strong convexity) Assume Assumption 1 (Differentiability), Assumption 2 (Interpolation Regime), Assumption 3 (Convexity), Assumption 4 (Smoothness) and Assumption 5 (Strong convexity) hold, then the convex function M γ (x) satisfies the following inequality, M γ (x) M γ (x ) µ 1 + γLmax 1 for any x Rd and a minimizer x of M γ (x). The star strong convexity property of M γ allows us to improve the sublinear convergence in the convex regime into linear convergence. D Technical lemmas Lemma 12. Let f : Rd 7 R be a proper, closed and convex function. Then x is a minimizer of f if and only if x = proxγf (x). Lemma 13. Assume we are working with the finite-sum problem f = 1 n Pn i=1 fi, where each fi is convex and Li-smooth, f is convex and L-smooth. Then the smoothness of L satisfies where both bounds are attainable. Lemma 14. Assume that all the conditions mentioned in Theorem 1 hold, then the condition number C(γ, τ, 1) of Fed Prox and the condition number C (γ, τ, αγ,τ) of the optimal constant extrapolation parameter α = 1 γLγ,τ satisfy the following inequality, C(γ, τ, 1) C (γ, τ, αγ,τ) 2 + 1 γLmax + γLmax 4 τ [n]. Lemma 15. Assume that all the conditions mentioned in Theorem 1 hold, then the following inequalities hold, C (γ, n, αγ,n) C (γ, τ, αγ,τ) C (γ, 1, αγ,1) , τ [n], and αγ,1 αγ,τ αγ,n, τ [n]. E Theory of SGD In order to prove our main theorem, we partly rely on the theory of SGD. The following theorem on the convergence of SGD with τ-nice sampling is adapted from Gower et al. [2019]. We introduce modifications to the proof technique and tailor the theorem specifically to the interpolation regime. In this context, the upper bound on the step size is increased by a factor of 2. We first formulate the algorithm as follows for completeness. Algorithm 2 SGD with τ-nice sampling 1: Parameters: learning rate η > 0, starting point x0 Rd, minibatch size τ {1, 2, . . . , n} 2: for k = 0, 1, 2, . . . do 3: The server samples Sk {1, 2, . . . , n} uniformly from all subsets of cardinality τ 4: The server performs one gradient step xk+1 = xk η 1 ξi Sk fξi(xk). Theorem 3. Assume Assumption 1 (Differentiability), 2 (Interpolation regime), 3 (Convexity), 4 (Smoothness) hold. Additionally, assume f is L-smooth where L 1 n Pn i=1 Li.3 If we are running SGD with τ-nice sampling using step size η that satisfies 0 < η < 2 Aτ , where Aτ := n τ τ(n 1)Lmax + n(τ 1) τ(n 1)L, and Lmax := max i Li, then the iterates of Algorithm 2 satisfy E [f( x K)] inf f 1 η(2 ηAτ) x0 x 2 where K is the total number of iterations, x K is chosen uniformly at random from the first K iterates {x0, x1, . . . , x K 1}. If, additionally, we assume the following property (which we will refer to as star strong convexity ) holds, then the iterates of Algorithm 2 satisfy E h x K x 2i 1 η(2 ηAτ) µ F Additional analysis on Fed Ex Prox In this section, we provide some additional details on the analysis of Fed Ex Prox and its adaptive variants. F.1 Fed Ex Prox in the strongly convex case The following corollary summarizes the convergence guarantee in the strongly convex case. Corollary 1. Suppose the assumptions in Theorem 1 hold, and assume in addition that Assumption 5 (Strong Convexity) holds, then we achieve linear convergence for the final iterate of Algorithm 1, which satisfies E h x K x 2i 1 αγ(2 αγLγ,τ) µ 2 (1 + γLmax) where the definition of Lγ,τ is given in Theorem 1. Fixing the choice of γ and τ, the optimal extrapolation parameter that minimizes the convergence rate is given by αγ,τ = 1 γLγ,τ > 1, which results in the following convergence in the strongly convex case: E h x K x 2i 1 µ 2Lγ,τ (1 + γLmax) 3This is justified by Lemma 13. As one can observe, by additionally assuming µ strong convexity of the original function f, we improve the sublinear convergence in the convex case into linear convergence. F.2 Fed Ex Prox in the non-smooth case Our analysis also adapts to the non-smooth cases. This is based on the observation that even if we only assume Assumption 1 (differentiability), Assumption 2 (Interpolation Regime) and Assumption 3 (Convexity) hold and do not have additional assumptions on smoothness, still each M γ fi is 1 γ -smooth because of Lemma 2. Thus, the theory of SGD in the convex smooth case still applies. However, there are some differences from the smooth case. For the sake of simplicity, we will mainly focus on the convex non-smooth case with a constant extrapolation parameter, the results in the strongly convex regime and with adaptive extrapolation can be obtained similarly as in the proof of Theorem 1 and Theorem 2. Theorem 4. Assume Assumption 1 (Differentiability), 2 (Interpolation Regime) and 3 (Convexity) hold. If we choose a constant extrapolation parameter αk = α satisfying 0 < α < 2 γLγ,τ , where Lγ is the smoothness constant of M γ (x) = 1 n Pn i=1 M γ fi (x), Lγ,τ is given by Lγ,τ = n τ τ(n 1) 1 Then the iterates of Algorithm 1 satisfy γM γ ( x K) inf γM γ 1 α (2 αγLγ,τ) x0 x 2 where x K is chosen uniformly from the first K iterates {x0, x1, . . . , x K 1}. It is easy to see that the best α is given by α = 1 γLγ,τ 1, where the corresponding convergence is given by γM γ ( x K) inf γM γ n τ τ(n 1) + n(τ 1) Remark 12. Notice that in this case we recover the convergence result of RPM presented in Necoara et al. [2019] in the convex case. Indeed, if each fi(x) = IXi (x), then we have proxγfi (x) = ΠXi (x) , x Rd, γM γ fi (x) = 1 2 x ΠXi (x) 2 , and γM γ (x) = 1 i=1 x ΠXi (x) 2 . Since we are in the interpolation regime, inf γM γ = 0, and the convergence result becomes i=1 x K ΠXi (x K) 2 n τ τ(n 1) + n(τ 1) Notice that here γLγ 1 is the smoothness constant associated with each distance function 1 2 x ΠXi (x) 2. The difference in the coefficients on the left-hand side from the original results presented in Necoara et al. [2019] results from different sampling strategies employed. A key difference in the non-smooth setting is that extrapolation in some cases may not be beneficiary, as illustrated by the following remark. Remark 13. In the non-smooth case, it is possible that γLγ = 1, where the optimal α = 1, in this case, extrapolation will not generate any benefits. However, as it is mentioned by Necoara et al. [2019], there are many examples where γLγ < 1 and extrapolation indeed accelerates the algorithm. This is different from the smooth case, where extrapolation always helps. Remark 14. Since we do not assume smoothness, Lemma 10 no longer applies. Therefore, the convergence result is stated in terms of the function value suboptimality of Moreau envelope instead of the original objective f which is used in the smooth case. Using a similar approach, it is also possible to obtain a convergence guarantee for Fed Ex Prox in the strongly convex non-smooth regime, assuming in addition that M γ (x) is µγ-strongly convex, where we recover the convergence result of RPM in Necoara et al. [2019] in cases where the smooth and linear regularity conditions are both satisfied. The following Table 4 confirms that our analysis of Fed Ex Prox recovers the theory of RPM as a special case. Table 4: Comparison of iteration complexity of RPM from Necoara et al. [2019] obtained using our theory and the original theory. In both cases, the optimal extrapolation parameter is used. The notation O( ) is hidden. ε is the error level reached by function value suboptimality for convex case, squared distance to the solution for strongly convex case. Setting Original Theory Our Theory Convex + smooth case(1) γLγ,τ x0 x 2 ε γLγ,τ x0 x 2 ε Strongly convex + smooth case(2) Lγ,τ µγ log x0 x 2 µγ log x0 x 2 (1) The smoothness here does not refer to each fi being Li-smooth, but γM γ being γLγ-smooth. This corresponds to the smooth regularity condition presented in Necoara et al. [2019]. (2) Here the strongly convex setting meaning that the linear regularity condition in Necoara et al. [2019] is satisfied. In our theory, it refers to M γ (x) being µγ-strongly convex with µγ < Lγ. F.3 Discussion on the non-interpolation case For the non-interpolation regime cases, we assume that Assumption 1 (Differentiability), Assumption 3 (Convexity) and Assumption 4 (Smoothness) hold. The differences are listed as follows (i) Although fi and M γ fi have the same set of minimizers, f and M γ does not necessarily have the same set of minimizers. This will lead to the convergence of Fed Ex Prox to the minimizer x ,γ of M γ (x) instead of x of f. As a result, we will only converge to a neighborhood of the x depending on the specific setting. (ii) Since we are not in the interpolation regime, the upper bound on the step size of SGD with sampling is reduced by a factor of 2. Thus, the optimal extrapolation parameter α in the non-interpolated cases is also halved, α = 1 2α . As a result, it is possible that α 1. The same phenomenon is also observed in Fed Ex P of Jhunjhunwala et al. [2023], where their heuristic in determining the extrapolation parameter adaptively is also reduced by a factor of 2 in non overparameterized cases. Observe that all of the above results in both smooth/non-smooth, interpolated/non-interpolated cases suggests that the practice of server simply averaging the iterates it obtained from local training is suboptimal. F.4 Discussion on the non-convex case In the non-convex case, we assume Assumption 1 (Differentiability) holds, and we need the following additional assumptions on f : Rd 7 R and fi : Rd 7 R: Assumption 6 (Lower boundedness). Function fi is lower bounded by inf fi. Assumption 7 (Weak convexity). Function fi is ρ > 0 weakly convex, this means that fi + ρ 2 2 is convex. We have the following lemma under the above assumptions: Lemma 16. [Böhm and Wright, 2021, Lemma 3.1] Let f be a proper, closed, ρ-weakly convex function and let γ < 1 ρ. Then the Moreau envelope M γ f is continuously differentiable on Rd with M γ f (x) = 1 γ x proxγf (x) . In addition, the Moreau envelope is max n 1 γ , ρ 1 γρ o -smooth. We will thereby denote the smoothness constant as Lγ,ρ. Indeed, if the stepsize γ in this case is chosen properly such that 1 γ > ρ, then it is straight forward to see the function within the proximity operator proxγfi given by fi + 1 γ 2 is strongly convex. Thus the proximity operator still results in a singleton. Lemma 16 allows us to again reformulate the original algorithm using the gradient of Moreau envelope. The only difference from the convex regime is that the Moreau envelope M γ fi is not necessarily convex. The following lemmas illustrate the connection between M γ fi and fi: Lemma 17. [Yu et al., 2015, Proposition 7] Let γ > 0, f be a closed, proper function that is lower bounded. Then M γ f f, inf M γ f = inf f, arg minx M γ f (x) = arg minx f(x) x : x proxγf (x) . Lemma 18. Let f : Rd 7 R be ρ-weakly convex with ρ > 0 and differentiable. If we take 0 < γ < 1 ρ, then M γ fi has the same set of stationary points as fi. For the sake of simplicity, we will consider only the full participation case with a constant extrapolation parameter αk = α. The following lemma describes the convergence of GD in the non-convex case, which is adapted from the theory of Khaled and Richtárik [2023]. Lemma 19. Assume function f is L-smooth and lower bounded. If we are running GD with a constant stepsize η satisfying 0 < η < 1 L. Then for any K 1, the iterates xk of GD satisfy min 0 k K 1 E h f(xk) 2i 2 (f(x0) inf f) Now we directly apply Lemma 19 in our case, 1. Since each M γ fi is Lγ,ρ-smooth, M γ is Lγ-smooth with Lγ Lγ,ρ, which result in the following bound on the extrapolation parameter 0 < α < 1 γLγ . Notice that in this case we have the following estimation of γLγ, 1 γLγ 1 γLγ,ρ = min 1, 1 γρ This suggests that extrapolation may not be much beneficiary in the non-convex case. 2. The following convergence guarantee can be obtained. min 0 k K 1 E h M γ(xk) 2i 2 (M γ (x0) inf M γ) Notice that by Lemma 17, we know that M γ fi (x0) fi (x0). We also have inf M γ 1 n Pn i=1 inf M γ fi = 1 n Pn i=1 inf fi since inf M γ fi = inf fi is true for each client by Lemma 17. Thus, we have M γ (x0) inf M γ f(x0) inf f + inf f 1 i=1 inf fi. We can relax the above convergence guarantee and obtain min 0 k K 1 E h M γ(xk) 2i 2 (f(x0) inf f) αγK + 2 inf f 1 n Pn i=1 inf fi The above convergence guarantee indicates that the algorithm converges to some stationary points of M γ (x) in the non-convex case. 3. In the non-convex case, we did not assume anything similar to the interpolation regime in the convex case. As a result, we did not know the relation between the set of stationary points of M γ (x) and f(x), denoted as Y and Y, respectively. However, if we assume, in addition, that each stationary point y Y of M γ is also a stationary point of each M γ fi, then y is also a stationary point of fi according to Lemma 18. Thus, f (y ) = 1 n Pn i=1 fi (y ) = 0, which indicates y Y. As a result, we have Y Y. This means that under this additional assumption, the algorithm converges to a stationary point of f. F.5 Additional notes on adaptive variants Notes on gradient diversity variant. In general, the gradient diversity step size ηk used in SGD to solve the finite sum minimization problem can be written as ηk := 1 Lmax 1 n Pn i=1 fi(xk) 2 1 n Pn i=1 fi(xk) 2 , where Lmax is the maximum of local smoothness constants. In our case, since each local Moreau envelope is Li 1+γLi -smooth and 1 γ -smooth4, we can use both Lmax 1+γLmax (here in Corollary 2, if we know Lmax) and 1 γ (in original Theorem 2, if we do not know Lmax) as the maximum of local smoothness. We present the convergence result of Algorithm 1 with the following rule given in (13), α k,G = 1 + γLmax 1 n Pn i=1 xk proxγfi (xk) 2 1 n Pn i=1 xk proxγfi (xk) 2 . Corollary 2. Suppose all the assumptions mentioned in Theorem 2 hold, if we are using (13) to determine α k,G in each iteration for Algorithm 1 with τ = n, then the iterates satisfy E [f( x K)] f inf 1 PK 1 k=0 α k,G . where x K is chosen randomly from the first K iterates {x0, x1, ..., x K 1} with probabilities pk = α k,G/PK 1 k=0 α k,G. Notice that compared to the case of Fed Ex Prox-Gra DS in Theorem 2, the convergence rate given in Corollary 2 is indeed better. This can be seen by comparing them directly, for Fed Ex Prox-Gra DS, we have E [f( x K)] inf f 1 + γLmax 2 + γLmax 1 PK 1 k=0 αk,G , and for Algorithm 1 with α k,G given in (13), we have E [f( x K)] f inf 1 PK 1 k=0 α k,G = γLmax 1 + γLmax 1 PK 1 k=0 αk,G . Since γLmax 1 + γLmax 1 + γLmax 2 + γLmax , γ > 0, the convergence of Algorithm 1 in the full participation case with (13) given in Corollary 2 is indeed better than Fed Ex Prox-Gra DS. However, this adaptive rule is only practical when we have the knowledge of local smoothness. 4Note that Li 1+γLi < 1 γ for any γ > 0. Notes on stochastic Polyak variant. In this paragraph, we further elaborate on the convergence of Fed Ex Prox-Sto PS. We start by providing a lower bound on the adaptive extrapolation parameter. Lemma 20. Suppose that all assumptions mentioned in Theorem 2 hold, then the following inequalities hold for any x Rd and x that is a minimizer of f, 1 n Pn i=1 M γ fi (x) M γ fi (x ) n Pn i=1 M γ fi (x) 2 1 2γLγ . Using the above lower bound, we can further write the convergence of Fed Ex Prox-Sto PS as E f( x K) inf f 2Lγ (1 + 2γLmax) x0 x 2 Observe that we recover the favorable dependence of convergence on the smoothness of M γ. However, this comes at the price of having to know each M γ fi (x ) or, equivalently in the interpolation regime, knowing M γ (x ). F.6 Extension of adaptive variants into client partial participation (PP) setting In this subsection, we extend the adaptive variants of Fed Ex Prox into the stochastic setting. We will refer to them as Fed Ex Prox-Gra DS-PP and, Fed Ex Prox-Sto PS-PP respectively. Specifically, we consider that the server chooses the client using the τ-nice sampling strategy we have introduced before in Algorithm 1. The following theorem summarizes the convergence guarantee of Fed Ex Prox-Gra DS-PP and Fed Ex Prox-Sto PS-PP in the convex case. Its extension to the strongly convex case where we additionally assume Assumption 5 (Strong convexity) is straight forward. Theorem 5. Suppose Assumption 1 (Differentiability), Assumption 2 (Interpolation regime), Assumption 3 (Convexity) and Assumption 4 (Smoothness) hold. Assume we are running Fed Ex Prox with τ-nice client sampling. (i) (Fed Ex Prox-Gra DS-PP): If we are using αk = ατ,k,G(xk, Sk), where ατ,k,G(xk, Sk) = 1 τ P i Sk xk proxγfi (xk) 2 1 τ P i Sk xk proxγfi (xk) 2 . (19) Then the iterates of Algorithm 1 satisfy E [f( x K)] inf f 1 + γLmax inf ατ,k,G K , (20) where K is the total number of iteration, x K is samples uniformly at random from the first K iterates {x0, x1, . . . , x K 1}, inf ατ,k,G is defined as inf ατ,k,G := inf x Rd,S [n],|S|=τ ατ,k,G (x, S) . satisfying ατ,k,G(xk, Sk) inf ατ,k,G 1. (ii) (Fed Ex Prox-Sto PS-PP): If we are using αk = ατ,k,S(xk, Sk), where ατ,k,S(xk, Sk) = 1 τ Pτ i=1 M γ fi (xk) inf M γ fi τ Pτ i=1 M γ fi (xk) 2 . (21) Then the iterates of Algorithm 1 satisfy E [f( x K)] inf f 1 inf ατ,k,S K , (22) where K is the total number of iteration, x K is sampled uniformly at random from the first K iterates {x0, x1, . . . , x K 1}, inf ατ,k,G is defined as inf ατ,k,S := inf x Rd,S [n],|S|=τ ατ,k,S (x, S) . ατ,k,S(xk, Sk) inf ατ,k,S 1 1 + 1 γLmax Remark 15. For Fed Ex Prox-Gra DS-PP, different from the full participation setting, the denominator of the sublinear term on the right-hand side of (20) is replaced by K inf ατ,k,G. (i) In the single client case (τ = 1), we have α1,k,G = inf α1,k,G = 1. (ii) In the partial participation case (1 < τ < n), it is possible that inf ατ,k,G > 1, resulting in acceleration compared to single client case. (iii) For the full participation case (τ = n), we have αk,G = αn,k,G, k=0 αk,G K inf αn,k,G, thus the convergence guarantee here is a relaxed version of that presented in Theorem 2. A similar discussion also applies to Fed Ex Prox-Sto PS-PP in the client partial participation setting. Remark 16. For Fed Ex Prox-Sto PS-PP, different from the full participation setting, the denominator of the sublinear term on the right-hand side of (22) is replaced by K inf ατ,k,S. (i) In the single client case (τ = 1), we have α1,k,S inf α1,k,G = 1 1 + 1 γLmax (ii) In the partial participation case (1 < τ < n), it is possible that inf ατ,k,S > 1 1 + 1 γLmax resulting in acceleration compared to single client case. (iii) For the full participation case (τ = n), we have αk,S = αn,k,S, k=0 αk,S K inf αn,k,S, thus the convergence guarantee here is a relaxed version of that presented in Theorem 2. The following Table 5 summarizes the convergence of new algorithms and their variants appeared in our paper. G Missing proofs of theorems and corollaries G.1 Proof of Theorem 1 The proof of this theorem can be divided into three parts. Table 5: Summary of convergence of new algorithms appeared in our paper in the convex setting. The O ( ) notation is hidden for all complexities in this table. For convergence in the full client participation case, results of Theorem 1 and Theorem 2 are used where the relevant notations are defined. For convergence in the partial participation, the results of Theorem 5 are used. Method Full Participation Partial Participation Single Client Fed Ex Prox Lγ(1+γLmax)/K Lγ,τ (1+γLmax)/K Lmax/K Fed Ex Prox-Gra DS (1+γLmax)/γ PK 1 k=0 αk,G (1+γLmax)/(γK inf ατ,k,G) (1+γLmax)/(γK) Fed Ex Prox-Sto PS (1+γLmax)/γ PK 1 k=0 αk,S (1+γLmax)/(γK inf ατ,k,S) (1+γLmax)/(γK inf α1,k,S) Step 1: Reformulate the algorithm using Moreau envelope. We know from Lemma 2 that for any x Rd. M γ fi (x) = 1 γ x proxγfi (x) . Using the above identity, we can rewrite the update rule given in (7) in the following form, xk+1 = xk αkγ 1 i Sk M γ fi (xk) . (23) The above reformulation suggests that running Fed Ex Prox with τ-nice sampling strategy is equivalent to running SGD with τ-nice sampling to the global objective M γ (x) = 1 n Pn i=1 M γ fi (x) with step size αkγ. Now, it seems natural to apply the theory of SGD adapted in Theorem 3. However, before proceeding, we list the properties we know about the global objective M γ and each local objective M γ fi. 1. Each M γ fi (x) is convex. This is a consequence of a direct application of Lemma 3 to each fi. Since M γ is the average of convex functions M γ fi, we conclude that M γ (x) is also convex. 2. Each M γ fi (x) is Li 1+γLi -smooth, where Li is the smoothness constant of fi. This is proved by applying Lemma 4 to each fi. Drawing on Lemma 13 for justification, it is reasonable to assume M γ (x) is Lγ-smooth with Lγ 1 n Pn i=1 Li 1+γLi -smooth. 3. Each M γ fi (x) has the same set of minimizers and minimum as fi. This result arises from applying Lemma 5 to each function fi. 4. Furthermore, if Assumption 2 (Interpolation Regime) holds, M γ (x) and f(x) have the same set of minimizers and minimum. This is demonstrated in Lemma 8. Step 2: Applying the theory of gradient type methods. Notice that here M γ fi (x) is Li 1+γLi -smooth and convex, M γ (x) is convex and Lγ-smooth. Furthermore, due to the assumption of interpolation regime, M γ (x) and f(x) have the same set of minimizers. Applying the theory of SGD with τ-nice sampling in this case, where Aτ = Lγ,τ = n τ τ(n 1) max i [n] Notice that using Fact 4, we know that Fact 4 = Lmax 1 + γLmax , thus Lγ can be simplified and written as Lγ,τ = n τ τ(n 1) Lmax 1 + γLmax + n(τ 1) where Lmax = maxi Li. We obtain the following result given that 0 < αγ < 2 Lγ,τ in the convex setting, E [M γ ( x K)] M γ (x ) Theorem 3 1 αγ(2 αγLγ,τ) x0 x 2 where x K is sampled uniformly at random from the first K iterates {x0, x1, . . . , x K 1}. However, the convergence mentioned pertains to M γ (x). Given our objective is to solve (1), it is necessary to reinterpret this outcome in terms of f. Step 3: Translate the result into function values of f. This step is only needed in the convex setting. We use the lower bound in Lemma 10, M γ ( x K) M γ (x ) (18) 1 1 + γLmax (f( x K) f(x )) , to obtain the following result E [f( x K)] f(x ) 1 + γLmax αγ (2 αγLγ,τ) x0 x 2 Observe that we have C (γ, τ, α) = 1 + γLmax αγ (2 αγLγ,τ), and its numerator does not depend on α. If we fix the choice of γ and τ, then the denominator is maximized when αγLγ,τ = 1. This yields the optimal constant extrapolation parameter αγ,τ = 1 γLγ,τ and the following convergence corresponding to it E [f( x K)] f(x ) Lγ,τ (1 + γLmax) x0 x 2 Finally, notice that γLγ Lemma 13 γLi 1 + γLi < 1, for any γ > 0. This suggests that, γLγ,τ = n τ τ(n 1) γLmax 1 + γLmax + n(τ 1) < n τ τ(n 1) + n(τ 1) τ(n 1) = 1, which in turn tells us αγ,τ = 1 γLγ,τ > 1. This concludes the proof. G.2 Proof of Theorem 2 We start with the following decomposition, xk+1 x 2 = xk αkγ M γ (xk) x 2 = xk x 2 2αkγ M γ (xk) , xk x + α2 kγ2 M γ (x) 2 . (24) Case 1: Fed Ex Prox-Gra DS For gradient diversity based αk, we have αk = αk,G = 1 n Pn i=1 γ M γ fi (xk) 2 γ M γ (xk) 2 = 1 n Pn i=1 M γ fi (xk) 2 M γ (xk) 2 . For the last term of (24), α2 k,Gγ2 M γ (xk) 2 = αk,Gγ2 1 M γ fi (xk) 2 M γ fi (xk) M γ fi (x ) 2 (16) αk,Gγ2 1 DM γ fi (xk, x ) + DM γ fi (x , xk) , where the last inequality follows from the Li 1+γLi -smoothness of M γ fi given in Lemma 4. We further obtain using Fact 4 that α2 k,Gγ2 M γ (xk) 2 Fact 4 αk,Gγ2 Lmax 1 + γLmax (DM γ (xk, x ) + DM γ (x , xk)) = αk,Gγ γLmax 1 + γLmax (DM γ (xk, x ) + DM γ (x , xk)) . (25) For the second term of (24), we have 2αk,Gγ M γ (xk) , xk x = 2αk,Gγ M γ (xk) M γ (x ) , x xk = 2αk,Gγ (DM γ (xk, x ) + DM γ (x , xk)) . (26) Plugging (26) and (25) into (24), we have xk+1 x 2 xk x 2 αk,Gγ 2 γLmax 1 + γLmax (DM γ (xk, x ) + DM γ (x , xk)) . Notice that we know that DM γ (xk, x ) (14) = M γ (xk) M γ (x ) , DM γ (x , xk) (15) 0. As a result, we obtain xk+1 x 2 xk x 2 αk,Gγ 2 γLmax 1 + γLmax (M γ (xk) M γ (x )) . Summing up the above recursion for k = 0, 1, ..., K 1, we notice that many of them will telescope and M γ (x ) = inf M γ due to interpolation regime as it is proved by Lemma 8. Thus, we obtain γ 2 γLmax 1 + γLmax k=0 αk,G (M γ (xk) inf M γ) x0 x 2 . Denote pk = αk,G/PK 1 k=0 αk,G for k = 0, 1, ..., K 1. If we pick x K randomly according to probabilities pk from the first K iterates {x0, x1, . . . , x K 1}, then we can further write the above recursion as E M γ x K inf M γ 1 + γLmax 2 + γLmax 1 PK 1 k=0 αk,G . Utilizing Lemma 10, we further obtain, E [f( x K)] inf f 1 + γLmax 2 + γLmax 1 PK 1 k=0 αk,G . The above inequality indicates convergence. Indeed, by convexity of standard Euclidean norm, we have n Pn i=1 xk proxγfi (xk) 2 1 n Pn i=1 xk proxγfi (xk) 2 = 1. This tells us that K 1 X k=0 αk,G K. Case 2: Fed Ex Prox-Sto PS For stochastic Polyak step size based αk,S, since we are in the interpolation regime, by Lemma 9, we have M γ (x ) = inf M γ = 1 i=1 inf M γ fi. As a result, αk = αk,S = 1 n Pn i=1 M γ fi (xk) inf M γ fi n Pn i=1 M γ fi (xk) 2 = M γ (xk) M γ (x ) γ M γ (xk) 2 . We have for the last term of (24), α2 k,Sγ2 M γ (xk) 2 = αk,Sγ (M γ (xk) M γ (x )) . (27) For the second term of (24), we have 2αk,Sγ M γ (xk) , xk x = 2αk,Sγ M γ (xk) , x xk (5) 2αk,Sγ (M γ (x ) M γ (xk)) = 2αk,Sγ (M γ (xk) M γ (x )) , (28) where the inequality is due to the convexity of M γ. Plugging (28) and (27) into (24), we obtain xk+1 x 2 xk x 2 αk,Sγ (M γ (xk) M γ (x )) . Summing up the above recursion for k = 0, 1, ..., K 1, we notice that many of them will telescope. Thus, we obtain k=0 αk,S (M γ (xk) inf M γ) x0 x 2 . Denote pk = αk,S/PK 1 k=0 αk,S for k = 0, 1, ..., K 1. If we sample x K randomly according to probabilities pk from the first K iterates {x0, x1, . . . , x K 1}, we can further write the above recursion as E M γ x K inf M γ 1 PK 1 k=0 αk,S . Utilizing the local bound in Lemma 10, we further obtain, E f( x K) inf f (17) 1 PK 1 k=0 αk,S . (29) Notice that the above inequality indeed indicates convergence, since M γ (xk) M γ (x ) γ M γ (xk) 2 1 2γLγ , where the inequality follows from Lemma 20. The above upper bounds allow us to further write the convergence in (29) as E f( x K) inf f 2Lγ (1 + 2γLmax) x0 x 2 This concludes the proof. G.3 Proof of Theorem 3 We start from the decomposition, xk+1 x 2 = xk x 2 2η i Sk fi(xk) i Sk fi(xk) where Sk is the set sampled at iteration k. Taking expectation conditioned on xk, we have ESk h xk+1 x 2i = xk x 2 2η xk x , f(xk) f(x ) + η2ESk i Sk fi(xk) We can write the second inner product term as xk x , f(xk) f(x ) (14) = Df (xk, x ) + Df (x , xk) , (30) where Df (xk, x ) denotes the Bregman divergence associated with f between xk and x . For the last squared norm term, we first define the indicator random variable χk,i as χk,i = 1, when i Sk, 0, when i / Sk. Since we are in the interpolation regime, we have i Sk fi(xk) i=1 χk,i ( fi(xk) fi(x )) Denote ak,i = fi(xk) fi(x ), i=1 χk,i ( fi(xk) fi(x )) i=1 χk,iak,i i=1 χ2 k,i ak,i 2 + X 1 i =j n χk,iχj,k ak,i, ak,j i=1 ESk i χ2 k,i ak,i 2 + X 1 i =j n ESk i [χk,iχj,k] ak,i, ak,j i=1 ak,i 2 + τ 1 nτ(n 1) = n τ τ(n 1) 1 i=1 ak,i 2 + n(τ 1) For the first term above in (31), due to the smoothness and convexity of each fi, we have i=1 ak,i 2 = 1 i=1 fi(xk) fi(x ) 2 i=1 Li (Dfi (x , xk) + Dfi (xk, x )) i=1 (Dfi (x , xk) + Dfi (xk, x )) = Lmax (Df (x , xk) + Df (xk, x )) , where the first inequality is obtained as a result of Fact 3. For the second term, we have due to the smoothness and convexity of f, 1 n i=1 ( fi(xk) fi(x )) = f(xk) f(x ) 2 L (Df (x , xk) + Df (xk, x )) , where the inequality is obtained using Fact 3. Combining the above two inequalities and plugging them into (31), we obtain i Sk fi(xk) τ(n 1) Lmax + n(τ 1) τ(n 1) L (Df (x , xk) + Df (xk, x )) . Notice that we already defined Aτ as Aτ = n τ τ(n 1) Lmax + n(τ 1) Combining (30) and (32), we have ESk h xk+1 x 2i xk x 2 η(2 ηAτ) (Df (x , xk) + Df (xk, x )) . If we require 0 < η < 2 Aτ , we have η(2 ηAτ) 0. Convex regime. It remains to notice that Df (xk, x ) + Df (x , xk) Df (xk, x ) = f(xk) f(x ) 0, and we have ESk h xk+1 x 2i xk x 2 η(2 ηAτ) (f(xk) f(x )) . Taking expectation again and using tower property, we get E h xk+1 x 2i E h xk x 2i η(2 ηAτ) (E [f(xk)] inf f) . Unrolling this recurrence, we get E [f( x K)] inf f 1 η(2 ηAτ) x0 x 2 where K is the total number of iterations, x K is selected uniformly at random from the first K iterates {x0, x1, . . . , x K 1}. Star strongly convex regime. Due to star strong convexity of f, we further lower bound the Bregman divergence Df (xk, x ) = f(xk) f(x ) µ and we have ESk h xk+1 x 2i 1 η(2 ηAτ) µ Taking expectation again, using tower property we get E h xk+1 x 2i 1 η(2 ηAτ) µ E h xk x 2i . Unrolling the recurrence, we get E h x K x 2i 1 η(2 ηAτ) µ This concludes the proof. G.4 Proof of Theorem 4 Since each fi is proper, closed and convex, by Lemma 2, we know that each M γ fi is 1 γ -smooth. Therefore, it is reasonable to assume that M γ = 1 n Pn i=1 M γ fi is Lγ-smooth, with Lγ 1 γ . Applying Theorem 3 in this case, we obtain, M γ ( x K) inf M γ Theorem 3 1 αγ (2 αγLγ,τ) x0 x 2 where x K is chosen uniformly at random from the first K iterates {x0, x1, . . . , x K 1}, and Lγ,τ = n τ τ(n 1) 1 Multiplying both sides by γ, we obtain γM γ ( x K) inf γM γ 1 α (2 αγLγ,τ) x0 x 2 It is easy to see that the coefficient on the right-hand side is minimized when α = 1 γLγ,τ , and the convergence is given by γM γ ( x K) inf γM γ n τ τ(n 1) + n(τ 1) Notice that Lγ 1 γ . As a result, α = 1 γLγ 1. G.5 Proof of Theorem 5 Case of Fed Ex Prox-Gra DS-PP. We start with the following identity xk+1 x 2 = xk x 2 2ατ,k,G γ i Sk M γ fi (xk) , xk x + α2 τ,k,G γ2 i Sk M γ fi (xk) For the last term, we have α2 τ,k,G γ2 i Sk M γ fi (xk) = ατ,k,G γ2 1 M γ fi (xk) 2 = ατ,k,G γ2 1 M γ fi (xk) M γ fi (x ) 2 , where the last step is due to the assumption that we are in the interpolation regime. Using Fact 3, we can further upper bound the above expression, α2 τ,k,G γ2 i Sk M γ fi (xk) ατ,k,G γ2 1 DM γ fi (xk, x ) + DM γ fi (x , xk) ατ,k,G γ γLmax 1 + γLmax 1 DM γ fi (xk, x ) + DM γ fi (x , xk) , (34) where the last inequality is due to Fact 4. Now we look at the second term in Equation (33). i Sk M γ fi (xk) , xk x = 2ατ,k,G γ M γ fi (xk) M γ fi (x ) , xk x = 2ατ,k,G γ 1 DM γ fi (xk, x ) + DM γ fi (x , xk) . (35) Plugging (34) and (35) into (33), we obtain, xk x 2 ατ,k,G γ 2 γLmax 1 + γLmax DM γ fi (xk, x ) + DM γ fi (x , xk) xk x 2 ατ,k,G γ 2 + γLmax M γ fi (xk) M γ fi (x ) , (36) where the last inequality is due to DM γ fi (xk, x ) (14) = M γ fi (xk) M γ fi (x ) , and DM γ fi (x , xk) (15) 0. Now we want to lower bound ατ,k,G, notice that it can be viewed as a function of the iterate x and the sampled set S. Therefore, we use the notation inf ατ,k,G = inf x Rd,S [n],|S|=τ ατ,k,G (x, S) . As a result, we have ατ,k,G inf ατ,k,G 1, where the second inequality comes from the convexity of standard Euclidean norm. Plugging this lower bound into (36), we obtain xk x 2 inf ατ,k,G γ 2 + γLmax M γ fi (xk) M γ fi (x ) . Taking expectation conditioned on xk, we have ESk h xk+1 x 2i xk x 2 inf ατ,k,G γ 2 + γLmax M γ fi (xk) M γ fi (x ) = xk x 2 inf ατ,k,G γ 2 + γLmax (M γ (xk) inf M) , where the last identity is due to the fact that we are in the interpolation regime. Using Lemma 10, we have ESk h xk+1 x 2i xk x 2 inf ατ,k,G γ 2 + γLmax 1 1 + γLmax (f(xk) inf f) . Taking expectation again and using tower property, we obtain E h xk+1 x 2i E h xk x 2i inf ατ,k,G γ 2 + γLmax 1 1 + γLmax E [f(xk) inf f] . Following the same step as Theorem 1, we can unroll the above recurrence and obtain E [f( x K)] inf f 1 + γLmax inf ατ,k,G K , where K is the total number of iterations, x K is sampled uniformly at random from the first K-iterates {x0, x1, . . . , x K 1}. Case of Fed Ex Prox-Sto PS-PP. We start with the following identity xk+1 x 2 = xk x 2 2ατ,k,S γ i Sk M γ fi (xk) , xk x + α2 τ,k,S γ2 i Sk M γ fi (xk) For the last term of Equation (37), we have α2 τ,k,S γ2 i Sk M γ fi (xk) = ατ,k,S γ 1 M γ fi (xk) inf M γ fi = ατ,k,S γ 1 DM γ fi (xk, x ) . (38) While for the second term we have i Sk M γ fi (xk) , xk x = 2ατ,k,S γ 1 DM γ fi (xk, x ) + DM γ fi (x , xk) (15) 2ατ,k,S γ 1 i Sk DM γ fi (xk, x ) . (39) Plugging (38) and (39) into (37), we obtain xk+1 x 2 xk x 2 ατ,k,S γ 1 M γ fi (xk) inf M γ fi Now we want to lower bound ατ,k,S, notice that it can be viewed as a function of the iterate x and the sampled set S. Therefore, we use the notation inf ατ,k,S = inf x Rd,S [n],|S|=τ ατ,k,S (x, S) . As a result, we have ατ,k,S inf ατ,k,S. Notice that since each M γ fi is Li 1+γLi -smooth, we conclude that the function 1 τ P i Sk M γ fi is at least Lmax 1+γLmax -smooth5. Using the smoothness of the mentioned function and Fact 3, a lower bound on inf ατ,k,S is obvious, inf αk,τ,S 1 2 Lmax 1+γLmax γ = 1 1 + 1 γLmax This means that we have ατ,k,S inf ατ,k,S 1 1 + 1 γLmax Using the above lower bound in (40), we have xk+1 x 2 xk x 2 inf ατ,k,S γ 1 M γ fi (xk) inf M γ fi Taking expectation conditioned on xk, and noticing that we are in the interpolation regime, we obtain ESk h xk+1 x 2i xk x 2 inf ατ,k,S γ (M γ (xk) inf M) . 5Same as M γ (x), its smoothness constant can be much better. Using Lemma 10, we have ESk h xk+1 x 2i Lemma 10 xk x 2 inf ατ,k,S γ 1 + γLmax (f(xk) inf f) . Now, following the exact same steps as in the previous case of Fed Ex Prox-Gra DS, we result in E [f( x K)] inf f 1 inf ατ,k,S K , where K is the total number of iterations, x K is sampled uniformly at random from the first K-iterates {x0, x1, . . . , x K 1}. G.6 Proof of Corollary 1 If additionally we assume f is µ-strongly convex, then from Lemma 11, we know it indicates the following star strong convexity of M γ holds, M γ (x) M γ (x ) µ 1 + γLmax 1 Thus, we apply Theorem 3 with τ-nice sampling in the star strong convexity case, and obtain the following result: E h x K x 2i Theorem 3 1 αγ(2 αγLγ,τ) µ 2 (1 + γLmax) Since the convergence here is stated in terms of squared distance to the minimizer, we do not need further transformation. Notice that the convergence rate in this case, 1 αγ(2 αγLγ,τ) µ 2 (1 + γLmax), is also minimized when α = αγ,τ = 1 γLγ,τ . In case of α = αγ,τ, the convergence is given by E h x K x 2i 1 µ 2Lγ,τ (1 + γLmax) This concludes the proof. G.7 Proof of Corollary 2 Similar to the proof of Theorem 2, we start with the following identity xk+1 x 2 = xk α k,Gγ M γ (xk) x 2 = xk x 2 α k,Gγ M γ (xk) , xk x + α k,G 2 γ2 M γ (x) 2 . (41) The extrapolation parameter can be rewritten as α k,G = 1 + γLmax 1 n Pn i=1 M γ fi (xk) 2 M γ (xk) 2 . We have for the last term of (41), α k,G 2 γ2 M γ (xk) 2 = α k,Gγ γ + 1 Lmax M γ fi (xk) 2 = α k,Gγ γ + 1 Lmax M γ fi (xk) Mγfix 2 α k,Gγ γ + 1 Lmax DM γ fi (xk, x ) + DM γ fi (x , xk) , where the last inequality follows from the Li 1+γLi -smoothness of M γ fi. Utilizing the monotonicity of x 1+γx, for x > 0, we further obtain α k,G 2 γ2 M γ (xk) 2 α k,Gγ γ + 1 Lmax Lmax 1 + γLmax 1 DM γ fi (xk, x ) + DM γ fi (x , xk) = α k,Gγ γ + 1 Lmax Lmax 1 + γLmax (DM γ (xk, x ) + DM γ (x , xk)) = α k,Gγ (DM γ (xk, x ) + DM γ (x , xk)) . (42) For the second term of (41), we have 2α k,Gγ M γ (xk) , xk x = 2α k,Gγ M γ (xk) , x xk = 2α k,Gγ M γ (xk) M γ (x ) , x xk = 2α k,Gγ (DM γ (xk, x ) + DM γ (x , xk)) . (43) Plugging (43) and (42) into (41), we obtain xk+1 x 2 xk x 2 α k,Gγ (DM γ (xk, x ) + DM γ (x , xk)) . Notice that we know that DM γ (xk, x ) (14) = M γ (xk) M γ (x ) , DM γ (x , xk) (15) 0. As a result, we have xk+1 x 2 xk x 2 α k,Gγ (M γ (xk) M γ (x )) . Summing up the above recursion for k = 0, 1, ..., K 1, we notice that many of them telescope, we obtain k=0 α k,G (M γ (xk) inf M γ) x0 x 2 . Denote pk = α k,G/PK 1 k=0 α k,G for k = 0, 1, ..., K 1. If we sample x K randomly according to probabilities pk from the first K iterates {x0, x1, . . . , x K 1}, we can further write the above recursion as E [M γ ( x K)] inf M γ 1 PK 1 k=0 α k,G . Utilizing the local bound in Lemma 10, we further obtain, E f( x K) f inf 1 PK 1 k=0 α k,G . This concludes the proof. H Missing proofs of lemmas H.1 Proof of Lemma 1 Notice that since f is proper, closed and convex, by Fact 1, proxγf (x) is a singleton. We use the notation z(x) = proxγf (x). Using the definition of proxγf (x), we see that M γ f (x) = f(z(x)) + 1 2γ z(x) x 2 = f proxγf (x) + 1 proxγf (x) x 2 . Now, assume M γ f (x) = + . We have for any z Rd, + = M γ f (x) = f (z(x)) + 1 2γ z(x) x 2 f(z) + 1 which means that z is also optimal, which contradicts the uniqueness z(x) = proxγf (x). This indicates that M γ f (x) < + , thus, it is real-valued, which concludes the proof. H.2 Proof of Lemma 2 Let f be the convex conjugate of f, using Corollary 6.56 in the book by Beck [2017], we have M γ f = f + γ 2 2. We know that the convex conjugate of a proper, closed and convex function is also proper closed and convex. As a result, f + γ 2 2 is γ-strongly convex. This indicates that M γ f is γ-strongly convex, which implies M γ f is 1 γ -smooth. Notice that we have proxγf (x) = arg min z Rd by the definition of proximity operator. Using Theorem 5.30 from Beck [2017], we have M γ f (x) = 1 γ x proxγf (x) . This completes the proof. H.3 Proof of Lemma 3 To prove this lemma, we use Theorem 2.19 in the book by Beck [2017]. From the key observation that M γ f is the infimal convolution of the proper, convex function f and the real-valued convex function 1 2γ 2, we deduce that M γ f is convex. This completes the proof. H.4 Proof of Lemma 4 Let f be the convex conjugate of f. From Corollary 6.56 in the book by Beck [2017], it holds that M γ f = f + γ 2 2. Since f is L-smooth, we deduce that f is 1 L-strongly convex, and thus M γ f is 1 L + γ-strongly convex. This suggests that M γ f is 1+γL L -strongly convex, which in turn implies M γ f is L 1+γL-smooth. This completes the proof. H.5 Proof of Lemma 5 Notice that since M γ f is convex and differentiable, the condition M γ f (x) = 0 gives its set of minimizers. This optimality condition can be written exactly as x = proxγf (x) according to Lemma 2. Using Lemma 12, we know this condition also gives the set of minimizers of f, which suggests that f and M γ f have the same set of minimizers. Pick any x Rd that is a minimizer of f, using Lemma 1, we have inf M γ f = M γ f (x ) = f proxγf (x ) + 1 x proxγf (x ) 2 = f(x ) = inf f. This completes the proof. H.6 Proof of Lemma 6 For any x Rd, we have M γ f (x) = min z Rd This completes the proof. H.7 Proof of Lemma 7 From Lemma 3 and Lemma 4, we immediately obtain that each M γ fi is convex and Li 1+γLi -smooth. This immediately suggests that M = 1 n Pn i=1 M γ fi is convex and Lγ-smooth with Li 1 + γLi . Then by Lemma 13, we have 1 n2 Combing the above two inequalities, we have Li 1 + γLi Lγ 1 Li 1 + γLi . We then look at the condition number defined in Theorem 1. It is easy to verify that C (γ, n, αγ,n) = Lγ (1 + γLmax) and, C (γ, 1, αγ,1) = Lmax. As a result, C (γ, n, αγ,n) = Lγ (1 + γLmax) i=1 Li 1 + γLmax Lmax = C (γ, n, 1) , Notice that we can write C (γ, τ, αγ,τ) as an interpolation between C (γ, n, αγ,n) and C (γ, 1, αγ,1), therefore Lγ (1 + γLmax) C (γ, n, αγ,n) C (γ, τ, αγ,τ) C (γ, 1, αγ,1) = Lmax. In cases where there exists at least one Li < Lmax, we have i=1 Li 1 + γLmax 1 + γLi < Lmax. which is true for all 0 < γ < + . Thus, C (γ, n, αγ,n) < C (γ, τ, αγ,τ) < Lmax = C (γ, 1, αγ,1). Now we give an example that when all Li = Lmax, still C (γ, n, αγ,n) = 1 n C (γ, 1, αγ,1) = 1 Example 1. Consider the setting where fi : Rd 7 R is defined as fi(x) = θ 2x2 i for some θ > 0. Here xi denotes the i-th coordinate of the vector x Rd, f : Rd 7 R is given by f(x) = θ 2n x 2. It is easy to show that for each fi is a convex, θ-smooth function and the smoothness constant θ cannot be improved since θ 2 x 2 θ For f(x) = θ 2n x 2, apparently, it is θ n-smooth and convex. We have the following formula for the Moreau envelope of fi(x), M γ fi (x) = 1 2 θ 1 + γθ x2 i . As expected, each one of them is convex and θ 1+γθ-smooth. For M γ (x), it is given by M γ (x) = 1 i=1 M γ fi (x) = 1 2 θ n(1 + γθ) x 2 , thus, we know it is convex and Lγ = θ n(1+γθ)-smooth. In this case Lmax Lγ (1 + γLmax) = θ θ n(1+γθ) (1 + γθ) = n, which is Lγ (1 + γLmax) = C (γ, n, αγ,n) = 1 n C (γ, 1, αγ,1) = 1 H.8 Proof of Lemma 8 By Lemma 5, we know that fi and M γ fi have the same set of minimizers and minimum. Denote the set of minimizers as Xi, since we are in the interpolation regime, we know that the set of minimizers of f is given by, Now we prove that every x in X is a minimizer of M = 1 n Pn i=1 M γ fi. This is true since x X minimizes each fi, thus M γ fi at the same time. The minimum is given by i=1 inf M γ fi = 1 i=1 inf fi = inf f. We then prove that every x / X is not a minimizer of f. If x / X, there exists at least one set Xj such that x / Xj. Thus M γ fj (x) > inf M γ fj. This indicates that M γ (x) > inf M, which means, x / X is not a minimizer of M. H.9 Proof of Lemma 9 From Lemma 6, it is clear that M γ f is a global lower bound of f satisfying M γ f (x) f(x) for any x Rd and γ > 0. Notice that the definition of M γ indicates that M γ (x) = 1 i=1 M γ fi (x) i=1 min zi Rd i=1 fi(z) + 1 = M γ f (x) , holds for any x Rd and γ > 0. Combining the above result, we have M γ (x) M γ f (x) f(x) for any x Rd and γ > 0. Notice that in Lemma 8, we have already shown that M γ and f have the same set of minimizers and minimum in the interpolation regime. A direct application of Lemma 5 indicates that M γ f and f have the same set of minimizers and minimum. Therefore, combining the above statement, we know that M γ, M γ f and f have the same set of minimizers and minimum. Thus, for any x belongs to the set of minimizers, we have M γ (x ) = M γ f (x ) = f(x ). This completes the proof. H.10 Proof of Lemma 10 We start from noticing that according to Lemma 1, the following identity is true for Moreau envelope, M γ fi (x) = fi(proxγfi (x)) + 1 x proxγfi (x) 2 . (44) For the second squared norm term, we have the following inequality due to the smoothness of each fi and the fact that fi proxγfi (x) = 1 γ x proxγfi (x) , x proxγfi (x) 2 = x proxγfi (x) , x proxγfi (x) = γ fi(proxγfi (x)), x proxγfi (x) γ fi(x) fi proxγfi (x) γLi x proxγfi (x) 2 , which leads to the following lower bound: x proxγfi (x) 2 1 fi(x) fi proxγfi (x) . Plug in the above inequality into (44) and notice that inf M = 1 n Pn i=1 inf M γ fi = 1 n Pn i=1 inf fi, we obtain the following lower bound on M γ fi (x), M γ fi (x) inf M γ fi fi proxγfi (x) + 1 2 + γLi fi(x) fi proxγfi (x) inf fi = 1 2 + γLi (fi(x) inf fi) + 1 1 2 + γLi fi(proxγfi (x)) inf fi . Now let us look at the term fi proxγfi (x) inf f. Using again Li-smoothness of fi, we have fi(x) fi(proxγfi (x)) fi(proxγfi (x)), x proxγfi (x) Li x proxγfi (x) 2 . Notice that x proxγfi (x) = γ fi(proxγfi (x)). As a result, we have, fi(x) γ fi(proxγfi (x)) 2 Liγ2 fi(proxγfi (x)) 2 fi(proxγfi (x)), fi(x) inf fi γ + γ2Li fi(proxγfi (x)) 2 fi proxγfi (x) inf fi. Using the interpolation regime assumption, we have fi proxγfi (x) 2 = fi proxγfi (x) fi(x ) 2 2Li Dfi proxγfi (x) , x = 2Li fi(proxγfi (x)) inf fi , where the inequality is obtained using Fact 3. As a result, we obtain the following bound, fi proxγfi (x) inf fi 1 1 + γLi(2 + γLi) (fi(x) inf fi) = 1 (1 + γLi)2 (fi(x) inf fi) . Plug the above lower bound into (45), we obtain M γ fi (x) inf M γ fi 1 1 + γLi (fi(x) inf fi) , (46) Notice that we have M γ (x) = 1 n Pn i=1 M γ fi (x). Since we are in the interpolation regime, from Lemma 9, we know that inf M γ = M γ (x ) = 1 i=1 M γ fi (x ) = 1 i=1 inf M γ fi, inf f = f(x ) = 1 i=1 fi(x ) = 1 i=1 inf fi. We average (46) for each i [n] and obtain M γ (x) inf M γ 1 1 1 + γLi (fi(x) inf fi) 1 1 + γLmax 1 i=1 (fi(x) inf fi) = 1 1 + γLmax (f(x) inf f) . This concludes the proof. H.11 Proof of Lemma 11 We start with picking any point x Rd, since we are in the interpolation regime, according to Lemma 9, we have M γ (x ) = f(x ). Applying Lemma 10, we get M γ (x) M γ (x ) 1 1 + γLmax (f(x) f(x )) . (47) We know that from the µ-strong convexity of f, we have for any x Rd, f(x) f(x ) f(x ), x x µ Notice that since f(x ) = 0, we have f(x) f(x ) µ 2 x x 2 . (48) Combining the above two inequalities (47) and (48), we have M γ (x) M γ (x ) µ 1 + γLmax 1 This concludes the proof. H.12 Proof of Lemma 12 Notice that x Rd is a minimizer of f if and only if 0 f(x). This inclusion holds if and only if 0 (γf(x)), which can be rewritten as x x (γf(x)). By the equivalence of (i) and (ii) in Fact 2, the above condition is the same as x = proxγf (x). H.13 Proof of Lemma 13 Since each fi is Li-smooth, the following function is convex for every i [n], 2 x 2 fi (x) . Thus, 1 n Pn i=1 Li 2 x 2 1 is also a convex function, which indicates f(x) is also 1 n Pn i=1 Li-smooth. This means that i=1 Li. (49) Now notice that the L-smoothness of f is equivalent to the following function being convex Pick any j [n], we have i=1 fi(x) + X 1 i =j n fi(x) = n L 2 x 2 fj(x). Since all functions are convex and the sum of convex functions is convex, n L 2 x 2 fj(x), is convex, which indicates that fj(x) is also n L-smooth. As a result, for every j [n], we have n L Lj. Summing up the inequality for every j [n], we have j=1 Lj L. (50) Combining (49) and (50), we have In order to demonstrate that both bounds are tight in the above inequality, we consider cases where they are identities. (i): Consider the case that each function fi(x) = 1 2 Li x 2, it is easy to see that f(x) = 1 2 1 n Pn i=1 Li x 2. In this case L = 1 n Pn i=1 Li, the upper bound is an identity. (ii): Consider the case that each function fi(x) = 1 2 θ x2 i , where θ > 0 is a constant, xi is the i-th coordinate of x Rd. In this case f(x) = 1 n x 2. It is easy to verify that in this case Li = θ, L = 1 nθ. Thus 1 n2 Pn i=1 Li = L, the lower bound is an identity. This concludes the proof. H.14 Proof of Lemma 14 From the definition of C(γ, τ, 1) and C (γ, τ, αγ,τ), we know that C(γ, τ, 1) C (γ, τ, αγ,τ) = 1 γLγ,τ (2 γLγ,τ). Now let t = γLγ,τ, we have the following bound on t according to the definition of Lγ,τ given in Theorem 1. = n τ τ(n 1) γLmax 1 + γLmax + n(τ 1) τ(n 1) γLγ. Notice that in Lemma 7, we have shown that Li 1 + γLi , and due to Fact 4, we have Lmax 1 + γLmax . As a result, t n τ τ(n 1) γLmax 1 + γLmax + n(τ 1) τ(n 1) γLmax 1 + γLmax = γLmax 1 + γLmax < 1. It is easy to show that g(t) = 1 t(2 t) is monotone decreasing when t [0, 1], thus C(γ, τ, 1) C (γ, τ, αγ,τ) 1 γLmax 1+γLmax 1 γLmax 1+γLmax = 2 + 1 γLmax + γLmax where the last inequality is due to the AM-GM inequality. This concludes the proof. H.15 Proof of Lemma 15 As suggested by Lemma 7, we have C (γ, n, αγ,n) C (γ, τ, αγ,τ) C (γ, 1, αγ,1) , τ [n]. Notice that αγ,τ is given by αγ,τ = 1 γLγ,τ , and we know that Lγ,τ = n τ τ(1 n) Lmax 1 + γLmax + n(τ 1) From Lemma 7 and Fact 4, we know that Lmax 1 + γLmax . Consequently, Lγ,τ decreases as τ increases. Therefore, αγ,τ increases with the increase of τ, as illustrated by the following inequality αγ,1 αγ,τ αγ,n, τ [n]. This concludes the proof. H.16 Proof of Lemma 16 We refer the readers to the proof of Lemma 3.1 of Böhm and Wright [2021]. H.17 Proof of Lemma 17 We refer the readers to the proof of Proposition 7 of Yu et al. [2015]. H.18 Proof of Lemma 18 Observe that since 0 < γ < 1 ρ, we do have f + 1 γ 2 being strongly convex. This indicates that proxγf is always a singleton and therefore M γ f is differentiable, as suggested by Lemma 16. Notice that x is stationary point of M γ f if and only if M γ f (x) = 0. This is equivalent to 1 γ x proxγf (x) = 0, which is x = proxγf (x). In addition, x = proxγf (x) is equivalent to γ (x x) = 0, which is f(x) = 0. Combining the above statements, we have f(x) = 0 if and only if M γ f (x) = 0. This suggests that the two functions have the same set of stationary points. H.19 Proof of Lemma 19 Apply Theorem 1 of Khaled and Richtárik [2023], notice that in this case GD satisfy the expected smoothness assumption given in Assumption 2 of Khaled and Richtárik [2023] with A = 0, B = 1 and C = 0, we obtain that when the step size η satisfies 0 < η < 1 LB = 1 where L is the smoothness constant of f, the iterates of GD satisfy min 0 k K 1 E h f(xk) 2i 2 (f(x0) inf f) This completes the proof. H.20 Proof of Lemma 20 Notice that we are in the interpolation regime, by Lemma 8, we know that f and M γ have the same set of minimizers and minimum. As a result, M γ (x ) = 1 i=1 M γ fi (x ) Lemma 8 = f(x ). (51) From the above inequality, we obtain that 1 n Pn i=1 M γ fi (x) M γ fi (x ) n Pn i=1 M γ fi (x) 2 (51) = M γ (x) M γ (x ) γ M γ (x) 2 . Then by the smoothness of M γ and Fact 3, we have M γ (x) M γ (x ) γ M γ (x) 2 1 2Lγ M γ (x) M γ (x ) 2 γ M γ (x) 2 Thus, by combining the above inequalities, we have 1 n Pn i=1 M γ fi (x) M γ fi (x ) n Pn i=1 M γ fi (x) 2 1 2γLγ . Notice that from the definition of αk,S for Fed Ex Prox-Sto PS, we have 1 n Pn i=1 M γ fi (xk) M γ fi (x ) n Pn i=1 M γ fi (xk) 2 1 2γLγ . Therefore, using the above lower bound, it is straight forward to further relax (12) to E f( x K) inf f 2Lγ (1 + 2γLmax) x0 x 2 This concludes the proof. I Experiments In this section, we describe the settings and results of numerical experiments to demonstrate the effectiveness of our method. I.1 Experiment settings We consider the overparameterized linear regression problem in the finite sum setting where d is the dimension of the problem, n is the total number of clients, each function fi has the following form 2 Aix bi 2 , where Ai Rni d, bi Rni. Here ni is the number of samples on each client. It is easy to see that for each function fi, we have fi(x) = A i Aix A i bi, and 2fi(x) = A i Ai Od. Thus, it follows that A i Aix A i bi , and 2f(x) = 1 i=1 A i Ai Od. The problem is therefore convex. Notice that one implicit assumption for the class of proximal point methods in practice is that the proximity operator can be computed efficiently. In the setting of linear regression, we have the following closed form formula for the proximity operator proxγfi, which holds for any x Rd, proxγfi (x) = A i Ai + 1 1 A i bi + 1 Observe that in the linear regression problem, since we know the closed form expression of each fi and f, we know the corresponding smoothness constant Li = λmax A i Ai . Notice that from Lemma 1, we have M γ fi (x) = fi proxγ (fi) + 1 x proxγ (fi) (x) 2 . Since we know proxγ (fi) in closed form using (52), we also know each local Moreau envelope in closed form, and thus the same for M γ = 1 n Pn i=1 M γ fi. As a result, we can deduce Lγ for M γ. In our experiments, we pick d Pn i=1 ni so that we are in the interpolation regime. Each Ai is generated randomly from a uniform distribution between [0, 1), and the corresponding vector bi is also generated from the same uniform distribution. In order to find a minimizer x , we run gradient descent for sufficient amount of iterations. All the codes for the experiments are written in Python 3.11 with Num Py and Sci Py package. The code was run on a machine with AMD Ryzen 9 5900HX Radeon Graphics @ 3.3 GHz and 8 cores 16 threads. For experiment in the small dimension regime, each algorithm considers here only takes seconds to finish. For larger experiments, depending on the specific implementation, the algorithms typically take a few minutes to half an hour to finish. For Fed Prox, Fed Ex P and our method Fed Ex Prox in the full participation case, the algorithm for a specific dataset is deterministic, while in case where client sampling is taken into account, the randomness of the algorithms comes from the specific sampling strategy used. Our code is publicly available at the following link: https://anonymous.4open.science/r/Fed Ex Prox-F262/ I.2 Large dimension regime In this section we provide the numerical experiments in the large dimension regime, where ni = 20 for each i [n], n = 30, d = 900. I.2.1 Comparison of Fed Ex Prox and Fed Prox In this section, we compare the performance of Fed Prox with our method Fed Ex Prox in the full participation case and in the client partial participation case, demonstrating that the extrapolated counterpart outperforms Fed Prox in iteration complexity. Notice that here we are only concerned with iteration complexity, since the amount of computations is almost the same for the two algorithms. The only difference is that for Fed Ex Prox, instead of simply averaging the iterates obtained from each client, the server performs extrapolation. From Figure 2, it is easy to see that our proposed algorithm Fed Ex Prox outperforms Fed Prox, which provides numerical evidence for our theoretical findings. Notably, in order to achieve the small level of function value sub-optimality, Fed Ex Prox typically requires only half the number of iterations needed by Fed Prox, which indicates a factor of 2 speed up in terms of iteration complexity. Another observation is that, αγ,n is decreasing as γ increases, which suggests that when local step sizes are small, the practice of simply averaging the iterates is far from optimal. We also compare the performance of the two algorithms in the client partial participation setting. As one can observe from Figure 3, Fed Ex Prox still outperforms Fed Prox in the client partial participation setting, which further corroborates our theoretical findings. Observe that αγ,τ here increases as τ becomes larger, which coincides with our predictions in Remark 7. 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.0001, αγ,n = 4.23199 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.001, αγ,n = 2.23763 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.01, αγ,n = 2.03808 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.1, αγ,n = 2.01799 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 1, αγ,n = 2.01596 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 10, αγ,n = 2.01576 Fed Prox Fed Ex Prox Figure 2: Comparison of convergence of Fed Ex Prox and Fed Prox in terms of iteration complexity in the full participation setting. For this experiment γ is picked from the set {0.0001, 0.001, 0.01, 0.1, 1, 10}, the αγ,n indicates the optimal constant extrapolation parameter as defined in Theorem 1. For each choice of γ, the two algorithms are run for K = 10000 iterations, respectively. 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.0001, αγ,10 = 3.22519, τ = 10 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.0001, αγ,15 = 3.22859, τ = 15 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.0001, αγ,20 = 3.23028, τ = 20 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.001, αγ,10 = 1.23594, τ = 10 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.001, αγ,15 = 1.23679, τ = 15 Fed Prox Fed Ex Prox 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.001, αγ,20 = 1.23721, τ = 20 Fed Prox Fed Ex Prox Figure 3: Comparison of convergence of Fed Ex Prox and Fed Prox in terms of iteration complexity in the client partial participation setting. For this experiment γ is picked from the set {0.0001, 0.001}, the client minibatch size τ is chosen from {10, 15, 20} and the αγ,n indicates the optimal constant extrapolation parameter as defined in Theorem 1. For each choice of γ and τ, the two algorithm are run for K = 10000 iterations, respectively. 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex P, t = 1 Fed Ex Prox, γ = 0.0001 Fed Ex Prox, γ = 0.001 Fed Ex Prox, γ = 0.01 Fed Ex Prox, γ = 0.1 Fed Ex Prox, γ = 1 Fed Ex Prox, γ = 10 Fed Ex Prox, γ = 100 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex P, t = 5 Fed Ex Prox, γ = 0.0001 Fed Ex Prox, γ = 0.001 Fed Ex Prox, γ = 0.01 Fed Ex Prox, γ = 0.1 Fed Ex Prox, γ = 1 Fed Ex Prox, γ = 10 Fed Ex Prox, γ = 100 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex P, t = 10 Fed Ex Prox, γ = 0.0001 Fed Ex Prox, γ = 0.001 Fed Ex Prox, γ = 0.01 Fed Ex Prox, γ = 0.1 Fed Ex Prox, γ = 1 Fed Ex Prox, γ = 10 Fed Ex Prox, γ = 100 Figure 4: Comparison in terms of iteration complexity for Fed Ex Prox with different step sizes γ chosen from {0.0001, 0.001, 0.01, 1, 10, 100} in the full participation setting. In the figure, we use Fed Ex P with different iterations of local training t {1, 5, 10} as a benchmark in the three sub-figures. The local step size for Fed Ex P is set to be the largest possible value 1 6t Lmax , where Lmax = maxi Li. 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex Prox, γ = 0.0001 Fed Ex Prox, γ = 0.0005 Fed Ex Prox, γ = 0.001 Fed Ex Prox, γ = 0.01 Fed Ex Prox, γ = 1 Fed Ex Prox, γ = 10 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex Prox, γ = 0.0001 Fed Ex Prox, γ = 0.0005 Fed Ex Prox, γ = 0.001 Fed Ex Prox, γ = 0.01 Fed Ex Prox, γ = 1 Fed Ex Prox, γ = 10 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex Prox, γ = 0.0001 Fed Ex Prox, γ = 0.0005 Fed Ex Prox, γ = 0.001 Fed Ex Prox, γ = 0.01 Fed Ex Prox, γ = 1 Fed Ex Prox, γ = 10 Figure 5: Comparison in terms of iteration complexity for Fed Ex Prox with different step sizes γ chosen from {0.0001, 0.0005, 0.01, 1, 10} in the client partial participation case. Different client minibatch sizes are used, the minibatch size τ is chosen from {5, 10, 20}. I.2.2 Comparison of Fed Ex Prox with different local step size In this section, we compare the performance in terms of iteration complexity for Fed Ex Prox with different local step sizes. We also include Fed Ex P as a reference. The local step size of Fed Ex P is chosen to be 1 6t Lmax , where t is the number of gradient descent iterations performed by each client for local training, Lmax = maxi Li, where Li is the smoothness constant of fi. As one can observe from Figure 4, for our proposed method Fed Ex Prox, the larger γ is, the faster it will converge. However, as γ becomes larger, the improvement in iteration complexity becomes trivial at some point. Note that for different γ, the complexities required to compute the proximity operator locally varies and often larger γ requires more computation than smaller γ. Compared to Fed Ex P with the best local step size 1 6t Lmax , Fed Ex Prox with a large enough γ is better in terms of iteration complexity. In the case where the computation of proximity operator is efficient, our method has a better computation complexity as well. Notice that small γ leads to slow down of our method, and we do not claim that the iteration complexity of Fed Ex Prox is always better than Fed Ex P. However, it is provable that Fed Ex Prox indeed has a better worst case iteration complexity. We want to emphasize a key difference between Fed Ex P and our method is that we do not have any constraints on the local step size γ, and our method converges for arbitrary local step size γ > 0, while for Fed Ex P, a misspecified step size could lead to divergence. We also compare Fed Ex Prox with different step sizes in the client sampling case, see Figure 5. However, since there is no explicit convergence guarantee for Fed Ex P in this case, we did not include Fed Ex P in the plot. In the client partial participation case, the same behavior of how our proposed algorithm Fed Ex Prox changes according to different local step sizes γ is observed. A small γ leads to slow convergence of the algorithm, while for large γ, the convergence is improved. However, at some point, the improvement becomes trivial. 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS Figure 6: Comparison of Fed Ex Prox, Fed Ex Prox-Gra DS and Fed Ex Prox-Sto PS in terms of iteration complexity with different step sizes γ chosen from {0.0005, 0.0005, 0.05, 0.5, 1, 5} in the full participation setting. I.2.3 Comparison of Fed Ex Prox and its adaptive variants In this section, we compare Fed Ex Prox and its two adaptive variants Fed Ex Prox-Gra DS and Fed Ex Prox-Sto PS. We first focus on the full participation case. Note that in this case, the all the algorithms are deterministic. For Fed Ex Prox-Gra DS, as it is suggested by Theorem 2, the extrapolation parameter is given by αk = αk,G := 1 n Pn i=1 xk proxγfi (xk) 2 1 n Pn i=1 xk proxγfi (xk) 2 . The server can use the local iterates it received from each client to compute αk,G directly. If, in addition, we know Lmax, we can implement a version that has a better theoretical guarantee, αk,G := 1 + γLmax 1 n Pn i=1 xk proxγfi (xk) 2 1 n Pn i=1 xk proxγfi (xk) 2 . For Fed Ex Prox-Sto PS, we have αk = αk,S = 1 n Pn i=1 M γ fi (xk) inf M γ fi n Pn i=1 M γ fi (xk) 2 . In order to implement αk,S, the server requires each client to send the function value of its Moreau envelope at the current iterate to it, and we need to know each inf M γ fi which, according to Lemma 5, is the same as inf fi. From Figure 6, we can observe that in all cases when γ is sufficiently large, Fed Ex Prox-Sto PS is the best among the three algorithms considered, and Fed Ex Prox-Gra DS outperforms Fed Ex Prox, this provides numerical evidence for the effectiveness of our proposed algorithms. In the cases when γ is small, the convergence of Fed Ex Prox-Gra DS seems to be better than the other two algorithms. We also plot the difference of extrapolation parameter used by the algorithms in each iteration. From Figure 7, observe that when γ is small, αk,G is often much larger than αk,S, resulting in better convergence of Fed Ex Prox-Gra DS as observed in the first two plots of Figure 6. When γ 100 101 102 103 104 Iterations Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS 100 101 102 103 104 Iterations Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS 100 101 102 103 104 Iterations Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS 100 101 102 103 104 Iterations Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS 100 101 102 103 104 Iterations Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS 100 101 102 103 104 Iterations Fed Ex Prox Fed Ex Prox-Gra DS Fed Ex Prox-Sto PS Figure 7: Comparison of the extrapolation parameter αk used by Fed Ex Prox, Fed Ex Prox-Gra DS and Fed Ex Prox-Sto PS in each iteration with different step sizes γ chosen from {0.0005, 0.0005, 0.05, 0.5, 1, 5} in the full participation setting. becomes larger, αk,G and αk,S become comparable, and their performance is also comparable, with Fed Ex Prox-Sto PS slightly better than Fed Ex Prox-Gra DS. We also conduct the experiment where we take client partial participation into account. We can observe from Figure 8 that in all cases, the two adaptive variants Fed Ex Prox-Gra DS-PP and Fed Ex Prox-Sto PS-PP outperform Fed Ex Prox in iteration complexity, and between the two adaptive variants, Fed Ex Prox-Gra DS is the better one almost all the time. However, Fed Ex Prox-Gra DS seems to be more stable than Fed Ex Prox-Sto PS, especially when γ is small. 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.01, τ = 20 Fed Ex Prox Fed Ex Prox-Gra DS-PP Fed Ex Prox-Sto PS-PP 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.5, τ = 20 Fed Ex Prox Fed Ex Prox-Gra DS-PP Fed Ex Prox-Sto PS-PP 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 10, τ = 20 Fed Ex Prox Fed Ex Prox-Gra DS-PP Fed Ex Prox-Sto PS-PP 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.001, τ = 10 Fed Ex Prox Fed Ex Prox-Gra DS-PP Fed Ex Prox-Sto PS-PP 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.01, τ = 10 Fed Ex Prox Fed Ex Prox-Gra DS-PP Fed Ex Prox-Sto PS-PP 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 1, τ = 10 Fed Ex Prox Fed Ex Prox-Gra DS-PP Fed Ex Prox-Sto PS-PP 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 0.05, τ = 5 Fed Ex Prox Fed Ex Prox-Gra DS-PP Fed Ex Prox-Sto PS-PP 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 5, τ = 5 Fed Ex Prox Fed Ex Prox-Gra DS-PP Fed Ex Prox-Sto PS-PP 0 2000 4000 6000 8000 10000 Iterations f (xk) f (x ) γ = 100, τ = 5 Fed Ex Prox Fed Ex Prox-Gra DS-PP Fed Ex Prox-Sto PS-PP Figure 8: Comparison of Fed Ex Prox, Fed Ex Prox-Gra DS and Fed Ex Prox-Sto PS in terms of iteration complexity with different step sizes γ in the client partial participation (PP) setting. The client minibatch size is chosen from {5, 10, 20}, for each minibatch size, a step size γ {0.001, 0.005, 0.1, 0.5, 1, 5, 10, 50, 100, 500} is randomly selected. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The abstract and introduction section accurately reflect the contributions made in this paper, which are mainly presented in Section 3, Section 4 and some parts of the Appendix. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The limitations of the work are discussed in Section 5.1. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Full set of assumptions and a complete and correct proof are described for every fact, lemma, theorem and corollary appeared in this paper. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The details of the experiments are included in the experiment section in Appendix I. The code is also provided in the corresponding link. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The details of the experiments are described in detail in Appendix I, and the code is given in the corresponding anonymous link. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: All the details of the experiments and link to anonymized repository are provided which is enough to understand the experiment. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: The details are depicted in the experiment section, and notice that for the full participation case of our proposed methods, it is deterministic for a specific dataset. No errors are needed in this case. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The computation resources needed for the experiments are described in the experiment section. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conducted in this paper conform with the Neur IPS Code of Ethics in every aspect. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: No potential social impact is expected by the authors. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper contain no such risks in the authors expectation. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: The experiment and code for this paper are well documented. The details of the dataset used is described in detail in the experiment section of the paper. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper dose not involve crowdsourcing. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.