# faster_federated_optimization_under_secondorder_similarity__d66fe393.pdf

Published as a conference paper at ICLR 2023

FASTER FEDERATED OPTIMIZATION UNDER SECONDORDER SIMILARITY

Ahmed Khaled Princeton University Chi Jin Princeton University

Federated learning (FL) is a subfield of machine learning where multiple clients try to collaboratively learn a model over a network under communication constraints. We consider finite-sum federated optimization under a second-order function similarity condition and strong convexity, and propose two new algorithms: SVRP and Catalyzed SVRP. This second-order similarity condition has grown popular recently, and is satisfied in many applications including distributed statistical learning and differentially private empirical risk minimization. The first algorithm, SVRP, combines approximate stochastic proximal point evaluations, client sampling, and variance reduction. We show that SVRP is communication efficient and achieves superior performance to many existing algorithms when function similarity is high enough. Our second algorithm, Catalyzed SVRP, is a Catalyst-accelerated variant of SVRP that achieves even better performance and uniformly improves upon existing algorithms for federated optimization under second-order similarity and strong convexity. In the course of analyzing these algorithms, we provide a new analysis of the Stochastic Proximal Point Method (SPPM) that might be of independent interest. Our analysis of SPPM is simple, allows for approximate proximal point evaluations, does not require any smoothness assumptions, and shows a clear benefit in communication complexity over ordinary distributed stochastic gradient descent.

1 INTRODUCTION

Federated Learning (FL) is a subfield of machine learning where many clients (e.g. mobile phones or hospitals) collaboratively try to solve a learning task over a network without sharing their data. Federated Learning finds applications in many areas including healthcare, Internet of Things (Io T) devices, manufacturing, and natural language processing tasks (Kairouz et al., 2019; Nguyen et al., 2021; Liu et al., 2021). One of the central problems of FL is federated or distributed optimization. Federated optimization has been the subject of intensive ongoing research effort over the past few years (Wang et al., 2021). The standard formulation of federated optimization is to solve a minimization problem: minx Rd h f(x) = 1 M PM m=1 fm(x) i , (1)

where each function fm represents the empirical risk of model x calculated using the data on the m-th client, out of a total of M clients. Each client is connected to a central server tasked with coordinating the learning process. We shall assume that the loss on each client is µ-strongly convex.

Because the model dimensionality d is often large in practice, the most popular methods for solving Problem (1) are first-order methods that only access gradients and do not require higher-order derivative information. Such methods include distributed (stochastic) gradient descent, Fed Avg (also known as Local SGD) (Koneˇcn y et al., 2016), Fed Prox (also known as the Stochastic Proximal Point Method) (Li et al., 2020b), SCAFFOLD (Karimireddy et al., 2020b), and others. These algorithms typically follow the intermittent-communication framework (Woodworth et al., 2021): the optimization process is divided into several communication rounds. In each of these rounds, the server sends a model to the clients, they do some local work, and then send back updated models. The server aggregates these models and starts another round.

Problem (1) is an example of the well-studied finite-sum minimization problem, for which we have tightly matching lower and upper bounds (Woodworth & Srebro, 2016). The chief quality that

Published as a conference paper at ICLR 2023

differentiates federated optimization from the finite-sum minimization problem is that we mainly care about communication complexity rather than the number of gradient accesses. That is, we care about the number of times that each node communicates with the central server rather than the number of local gradients accessed on each machine. This is because the cost of communication is often much higher than the cost of local computation, as Kairouz et al. (2019) state: It is now well-understood that communication can be a primary bottleneck for federated learning.

One of the main sources of this bottleneck is that when all clients participate in the learning process, the cost of communication can be very high (Shahid et al., 2021). This can be alleviated in part by using client sampling (also known as partial participation): by sampling only a small number of clients for each round of communication, we can reduce the communication burden while retaining or even accelerating the training process (Chen et al., 2022).

Our main focus in this paper is to develop methods for solving Problem (1) using client sampling and under the following assumption:

Assumption 1. (Second-order similarity). We assume that for all x, y Rd we have

1 M PM m=1 fm(x) f(x) [ fm(y) f(y)] 2 δ2 x y 2.

Assumption 1 is a slight generalization of the δ-relatedness assumption used by Arjevani & Shamir (2015) in the context of quadratic optimization and by Sun et al. (2022) for strongly-convex optimization. It is also known as function similarity (Kovalev et al., 2022). Assumption 1 holds (with relatively small δ) in many practical settings, including statistical learning for quadratics (Shamir et al., 2014), generalized linear models (Hendrikx et al., 2020), and semi-supervised learning (Chayti & Karimireddy, 2022). We provide more details on the applications of second-order similarity in Appendix B. Under the full participation communication model where all clients participate each iteration, several methods can solve Problem (1) under Assumption 1, including ones that tightly match existing lower bounds (Kovalev et al., 2022). In contrast, for the setting we consider (partial participation or client sampling), no lower bounds are known. The main question of our work is:

Can we design faster methods for federated optimization (Problem (1)) under second-order similarity (Assumption 1) using client sampling?

1.1 CONTRIBUTIONS

We answer the above question in the affirmative and show the utility of using client sampling in optimization under second-order similarity for strongly convex objectives. Our main contributions are as follows:

A new algorithm for federated optimization (SVRP, Algorithm 2). We develop a new algorithm, SVRP (Stochastic Variance-Reduced Proximal Point), that utilizes client sampling to improve upon the existing algorithms for solving Problem 1 under second-order similarity. SVRP has a better dependence on the number of clients M in its communication complexity than all existing algorithms (see Table 1), and achieves superior performance when the dissimilarity constant δ is small enough. SVRP trades off a higher computational complexity for less communication. Catalyst-accelerated SVRP. By using Catalyst (Lin et al., 2015), we accelerate SVRP and obtain a new algorithm (Catalyzed SVRP) that improves the dependence on the effective conditioning

δ µ. Catalyzed SVRP also has a better convergence rate (in number of communication steps, ignoring constants and logarithmic factors) than all existing accelerated algorithms for this problem under Assumption 1, reducing the dependence on the number of clients multiplied by the

effective conditioning q

δ µM 3/4 (see Table 1).

While both SVRP and Catalyzed SVRP achieve a communication complexity that is better than algorithms designed for the standard finite-sum setting (like SVRG or SAGA), the computational complexity is a lot worse. This is because we tradeoff local computation complexity for a reduced communication complexity. Additionally, both SVRP and Catalyzed SVRP are based upon a novel combination of variance-reduction techniques and the stochastic proximal point method (SPPM). SPPM is our starting point, and we provide a new analysis for it that might be of independent interest. Our analysis of SPPM is simple, allows for approximate evaluations of the proximal operator, and

Published as a conference paper at ICLR 2023

Method Communication Complexity Non-quadratic? Reference

DANE O δ2 µ2 M

(Shamir et al., 2014)

(Zhang & Lin, 2015)

SCAFFOLD O δ+L

(Karimireddy et al., 2020b)

SONATA O δ µM

(Sun et al., 2022)

Accelerated SONATA O q

(Tian et al., 2022)

Accelerated Extragradient O q

(Kovalev et al., 2022)

Lower Bound (no sampling) O q

- (Arjevani & Shamir, 2015)

SVRP O δ2 µ2 + M

Theorem 2 (NEW)

Catalyzed SVRP O q

δ µM 3/4 + M

Theorem 3 (NEW)

Table 1: Communication complexity of different methods for solving Problem 1 under µ-strong convexity, L-smoothness, and Assumption 1. O( ) ignore polylogarithmic factors and constants. We use the client sampling model, which counts exchanging a vector between the server and one of the clients as a single communication step.

extends to include variance reduction. In Appendix H we also consider the more general constrained optimization problem and provide similar convergence rates in that setting.

2 RELATED WORK

Distributed optimization under Assumption 1. There is a long line of work analyzing distributed optimization under Assumption 1 and strong convexity: Shamir et al. (2014) first gave DANE and analyzed it for quadratics, and showed the benefits of using second-order similarity in the setting of statistical learning for quadratic objectives. Zhang & Lin (2015) developed the Di SCO algorithm that improved upon DANE for quadratics, and also analyzed it for self-concordant objectives. Arjevani & Shamir (2015) gave a lower bound that matched the rate given by DANE, though without allowing for client sampling. The theory of DANE was later improved in (Yuan & Li, 2019), allowing for local convergence for non-quadratic objectives. Another algorithm SCAFFOLD (Karimireddy et al., 2020b) can be seen as a variant of DANE and is also analyzed for quadratics. In the context of decentralized optimization, Sun et al. (2022) gave SONATA and showed a similar rate to DANE but for general strongly convex objectives, then Tian et al. (2022) improved the convergence rate of SONATA by acceleration. Finally, Kovalev et al. (2022) improved the convergence rate of accelerated SONATA even further by removing extra logarithmic factors. We give an overview of all related results in Table 1 and provide more thorough comparisons in the theory and algorithms section.

A parallel line of work considers Assumption 1 as a special case of relative strong convexity and smoothness and utilizes methods based on mirror descent. Hendrikx et al. (2020) take this view and consider an accelerated variant of mirror descent in the distributed setting, while Dragomir et al. (2021) also consider sampling and variance-reduction. Because this setting is much more general, it is more challenging to prove tight convergence rates, and in the worst-case the convergence rates are no better than the minimax rate under only smoothness and strong convexity.

Another line of work considers federated optimization for Problem 1 under Assumption 1 but without convexity, as well as optimization under the scenario where the number of clients is very large or infinite. Examples of this include MIME (Karimireddy et al., 2020a), Fed Shuffle MVR (Horv ath et al., 2022), as well as Aux Mom and Aux MVR (Chayti & Karimireddy, 2022). Our focus in this work is on the convex setting.

Stochastic Proximal Point Method. The stochastic proximal point method (SPPM) is our starting point in developing SVRP and its Catalyzed variant. SPPM is well-studied, and we only briefly review some results on its convergence: Bertsekas (2011) terms it the incremental proximal point method, and provides analysis showing nonasymptotic convergence around a solution under the assumptions

Published as a conference paper at ICLR 2023

that each fm is Lipschitz. Ryu & Boyd (2014) provide convergence rates for the algorithm, and observe that it is stable to learning rate misspecification unlike stochastic gradient descent (SGD). P atras cu & Necoara (2017) analyze SPPM for constrained optimization with random projections. Asi & Duchi (2019) study a more general method (AProx) that includes SPPM as a special case, giving stability and convergence rates under convexity. Asi et al. (2020); Chadha et al. (2022) further consider minibatching and convergence under interpolation for AProx.

In the context of federated learning for non-convex optimization, SPPM is also known as Fed Prox (Li et al., 2020b) and has been analyzed in several settings, see e.g. (Yuan & Li, 2022). Unfortunately, in the non-convex setting the convergence rates achieved by Fed Prox/SPPM are no better than SGD. SPPM can be applied to more than just federated learning and has found applications in matrix and tensor completion (Bumin & Huang, 2021) and reinforcement learning (Asadi et al., 2021). It can also be implemented efficiently for various optimization problems (Shtoff, 2022).

Compression for communication efficiency. Client sampling is one way of achieving communication efficiency, but there are other ways to do that in federated learning, such as compressing the vectors exchanged between the server and the clients. Szlendak et al. (2022); Beznosikov & Gasnikov (2022) consider federated optimization with compression under Assumption 1 and obtain better convergence rates using specially-crafted compression operators. Because these techniques are orthogonal, exploring combinations of client sampling and compression may be a promising avenue for future work.

3 PRELIMINARIES

We say that a differentiable function f is µ-strongly convex (for µ > 0) if for all x, y Rd we have f(x) f(y) + f(y), x y + µ

2 x y 2. We assume:

Assumption 2. All the functions f1, f2, . . . , f M in problem (1) are µ-strongly convex. We also assume Problem (1) has a solution x Rd.

The assumption that every fm is strongly convex is common in the analysis of federated learning algorithms (Karimireddy et al., 2020b; Mishchenko et al., 2022b) and is often realized when each fm represents a convex empirical loss with ℓ2-regularization. We assume that f has a minimizer x , and by strong convexity this minimizer is unique. The proximal mapping associated with a function h and stepsize η > 0 is defined as

proxηh(x) = arg miny Rd h ηh(y) + 1

When h is convex, the minimization problem has a unique solution and hence the proximal operator is well-defined. We say that a point y Rd is a b-approximation of the proximal operator evaluated at x if y proxηh(x) 2 b. When h = fm for some m [M], computing the proximal operator is equivalent to solving a local optimization problem on node m.

4 ALGORITHMS AND THEORY

In this section we develop the main algorithms of our work for solving Problem 1 under Assumption 1, smoothness, and strong convexity. In the first subsection, we analyze the stochastic proximal point method and explore some of its desirable properties. Next, we augment the stochastic proximal point method with variance-reduction and develop SVRP, a novel algorithm that improves upon existing algorithms using client sampling. Finally, we use the Catalyst acceleration framework (Lin et al., 2015) to improve the convergence rate of SVRP. We give more details on how the algorithms are applied in a client-server setting in Appendix I.

4.1 BASICS: STOCHASTIC PROXIMAL POINT METHOD

The starting point of our investigation is the stochastic proximal point method (SPPM) (Algorithm 1), because the stochastic proximal point algorithm can achieve rates of convergence that are smoothnessindependent and which rely only on the strong convexity of the minimization objectives (Asi &

Published as a conference paper at ICLR 2023

Algorithm 1: Stochastic Proximal Point Method (SPPM) Data: Stepsize η, initialization x0, number of steps K, proximal solution accuracy b.

1 for k = 0, 1, 2, . . . , K 1 do

2 Sample ξk D.

3 Update with b-approximation of the stochastic proximal point operator:

xk+1 proxηfξk (xk) .

Duchi, 2019). This makes SPPM a much better starting point for developing new algorithms if our goal is to obtain convergence rates that depend on the dissimilarity constant δ instead of the (typically larger) smoothness constant L. The stochastic proximal point can be applied to the general stochastic expectation problem, which has the following form:

minx Rd [f(x) = Eξ D [fξ(x)]] , (2)

where each fξ is µ-strongly convex and differentiable. Observe that Problem (1) is a special case of (2) where D has finite support. We assume that f has a (necessarily unique) minimizer x . In this formulation, sampling a new ξ D corresponds to sampling a node/client in federated optimization, and then a proximal iteration corresponds to a local optimization problem to be solved on node ξ. The next theorem characterizes the convergence of SPPM in this setting: Theorem 1. Suppose that fξ is almost surely µ-strongly convex, let x be the minimizer of f, and

define σ2 = Eξ D h fξ(x ) 2i . Suppose that for each k we have that xk+1 is a b-approximation

of the proximal. Set η = µϵ 2σ2 and b ϵ

(1+ηµ)2 . Then E h x K x 2i ϵ after K iterations:

K = 1 + 2σ2 µ2ϵ log 4 x0 x 2

The full proofs of all the theorems are relegated to the supplementary material. Theorem 1 for SPPM essentially gives the same rate as (Asi & Duchi, 2019, Proposition 5.3) but with a different proof technique. Our analysis relies on a straightforward application of the contractivity of the proximal, making it easier to extend to the variance-reduced case compared to the more involved analysis of (Asi & Duchi, 2019).

Comparison with SGD. The iteration complexity of SGD in the same setting is:

µ + 2σ2 µ2ϵ log 2 x0 x 2

See (Needell et al., 2014; Gower et al., 2019) for a derivation of this iteration complexity and (Nguyen et al., 2019) for a matching lower bound (up to log factors and constants). Observe that while the dependence on the stochastic noise term is the same in both (3) and (4), the iteration complexity of SGD also has an additional dependence on the condition number κ = L µ while the iteration complexity of the stochastic proximal point method is entirely independent of the magnitude of the smoothness constant L. Thus, we can obtain a faster convergence rate than SGD if we have access to stochastic proximal operator evaluations. In federated optimization, a stochastic proximal operator evaluation can be done entirely with local work and with no communication, and thus is relatively cheap. Indeed, the iteration complexity of SPPM can even beat accelerated SGD because acceleration only reduces the dependence on the condition number from L/µ to p

L/µ, whereas SPPM has no dependence on the condition number to begin with.

Communication vs computation complexities. Every iteration of SPPM involves two communication steps: the server sends the current iterate xk to node ξ, and then node ξ sends xk+1 back to the server. Thus the communication complexity of SPPM is the same as eq. (3) multiplied by two. Each node needs to solve the optimization problem

minx Rd h fξ(x) + 1 2η x xk 2i

up to the accuracy b given in Theorem 1. If each fξ is L-smooth and µ-strongly convex, this is a L + 1 2η-smooth and µ + 1 2η-strongly convex minimization problem, and thus can be solved to the

Published as a conference paper at ICLR 2023

Algorithm 2: Stochastic Variance-Reduced Proximal Point (SVRP) Method Data: Stepsize η, initialization x0, number of steps K, communication probability p, local solution accuracy b.

1 Initialize w0 = x0.

2 for k = 0, 1, 2, . . . , K 1 do

3 Sample mk uniformly at random from [M].

gk = f(wk) fmk(wk).

5 Compute a b-approximation of the stochastic proximal point operator associated with fmk:

xk+1 proxηfmk (xk ηgk) . (5)

6 Sample ck Bernoulli(p) and update wk+1 = xk+1 if ck = 1, wk if ck = 0.

desired precision b in O q

ηL+1 ηµ+1 log 1

b local gradient accesses using accelerated gradient descent (Nesterov, 2018). When η = µϵ 4σ2 , this corresponds to a per-iteration computational complexity of

µ2ϵ+4σ2 log 1

b . Note that if fξ itself represents the loss on a local dataset (as is common in federated learning), we may use methods tailored for stochastic or finite-sum problems such as Random Reshuffling (Mishchenko et al., 2020), Katyusha (Allen-Zhu, 2017), or (accelerated) SVRG. Thus we see that, compared to SGD, SPPM trades off a higher computational complexity for a lower communication complexity.

Related work. We compare against related convergence results for SPPM in Appendix E.

4.2 THE SVRP ALGORITHM

The rate of SPPM, while independent of the condition number κ = L

µ , is sublinear. While a sublinear rate is optimal for the stochastic oracle (Foster et al., 2019; Woodworth & Srebro, 2021), it is suboptimal in the setting of smooth finite-sum minimization (Woodworth & Srebro, 2016). In this section, we develop a novel variance-reduced method, SVRP (Stochastic Variance-Reduced Proximal Method, Algorithm 2), that converges linearly and relies only on second-order similarity.

Variance-reduced methods such as SVRG (Johnson & Zhang, 2013), SAGA (Defazio et al., 2014) or SARAH (Nguyen et al., 2017) improve the convergence rate of SGD for finite-sum problems by constructing gradient estimators whose variance vanishes over time. While SGD is used as the building block in most existing variance-reduced methods, in the preceding section we saw that the stochastic proximal point method is more communication-efficient; It stands to reason that variancereduced variants of SPPM could also be more communication-efficient under second-order similarity. We apply variance-reduction to SPPM and develop our algorithm in the next two steps.

Step (a): Adapting SVRG-style variance reduction to SPPM. We use SVRG-style variancereduction coupled with the stochastic proximal point method as a base. To see how, we start with SGD iterations: at each step k we sample node mk uniformly at random from [M] and update as

xk+1 = xk η fmk(xk).

The main problem with SGD is that the stochastic gradient estimator has a non-vanishing variance that slows down convergence. SVRG (Johnson & Zhang, 2013) modifies SGD by adding a correction term gk at each iteration:

gk = f(wk) fmk(wk), xk+1 = xk η [ fmk(xk) + gk] ,

where wk is an anchor point that is periodically reset to the current iterate. Thus we added the correction term gk in order to reduce the variance in the gradient estimator and allow the algorithm to

Published as a conference paper at ICLR 2023

converge. We propose to do the same for the stochastic proximal point method, where we instead change the argument to the proximal operator:

gk = f(wk) fmk(wk), xk+1 = proxηfmk (xk ηgk) .

We can expect the correction term gk to function similarly and allow SPPM to converge faster.

Step (b): Removing the loop. Rather than reset the anchor point wk to the current iterate at fixed intervals, SVRP is instead loopless: it uses a random coin flip to determine when to communicate and re-compute full gradients. Loopless variants of variance-reduced algorithms such as L-SVRG (Kovalev et al., 2020) and Loopless SARAH (Li et al., 2020a) enjoy the same convergence guarantees as their ordinary counterparts, but with superior empirical performance and simpler analysis.

Combining the previous two steps gives us SVRP (Algorithm 2), and we give the convergence result next.

Theorem 2. (Convergence of SVRP). Suppose that Assumptions 1 and 2 hold, and that each xk+1 is a b-approximation of the proximal (5). Let τ = min n ηµ 1+2ηµ, p

2 o . Set the parameters of Algorithm 2

as η = µ 2δ2 , b ϵτ(ηµ)2

2(1+ηµ)3 , and p = 1 M . Then the final iterate x K satisfies E h x K x 2i ϵ provided that the total number of iterations K is larger than Titer:

Titer = O M + δ2

Communication complexity. We consider one communication to represent the server exchanging one vector with one client. At each step of SVRP, the server samples a client mk and sends them the current iterate xk, the client then computes gk and xk+1 locally, and sends xk+1 back to the server. Then, with probability p, the server changes the anchor point wk+1 to xk+1, sends wk+1 to the new clients, each client m then computes fm(wk+1) and sends it back to the server, which averages the received gradients to get f(wk+1); The server then proceeds to send f(wk+1) back to all the clients. Thus the expected communication complexity is

E [Tcomm] = (2 + 3p M) Titer = 5Titer = O M + δ2

Compared to the SVRG communication complexity O M + L

ϵ (Sebbouh et al., 2019),

this replaces the L/µ dependence with a δ2/µ2 dependence. This is better when δ Lµ.

Comparison with existing results. The lower bound related the most to our setting is given by (Arjevani & Shamir, 2015), only considers the setting of full participation (i.e. no client sampling)

and corresponds to a communication complexity of O q

δ µM . Our result improves upon this when

M > (δ/µ)3/2. Note that while our result attains a superior dependence on M, the dependence on the effective conditioning δ/µ is worse. In the next section, we shall improve this via acceleration. Note that the best existing results for optimization under second-order similarity, such as Di SCO (Zhang & Lin, 2015), Accelerated SONATA (Tian et al., 2022), and Extragradient sliding (Kovalev et al., 2022), match this lower bound (Kovalev et al., 2022, Table 1). Therefore, our result shows that significantly better convergence can be obtained under second-order similarity when using client sampling.

Computational complexity. Similar to the discussion of the computational complexity of SPPM in the previous section, here too we essentially need to solve on each sampled device an optimization problem involving an (L+η)-smooth and (µ+η)-strongly convex function up to the accuracy b. This

can be done using accelerated gradient descent in O q

L+η µ+η log 1

µ +1 2δ2+1 log 1

accesses. Compared to SGD or SVRG, we can see that we trade off a higher local Computational complexity for a lower number of communications steps to convergence.

Similar methods. Point-SAGA (Defazio, 2016) uses the stochastic proximal point method coupled with SAGA-style variance reduction. Point-SAGA achieves optimal performance under smoothness

Published as a conference paper at ICLR 2023

and strong convexity, but it inherits the heavy memory requirements of SAGA and its performance under Assumption 1 is unknown. Another similar algorithm is SCAFFOLD (Karimireddy et al., 2020b): SCAFFOLD uses a similar SVRG-style correction sequence, and their method can be viewed as approximately solving a local unregularized minimization problem at each step. Unfortunately, their analysis under Assumption 1 only holds for quadratic objectives and without client sampling.

4.3 ACCELERATING SVRP VIA CATALYST

In this section we improve SVRP by augmenting it with acceleration. Acceleration is an effective way to improve the convergence rate of first-order gradient methods, capable of achieving better convergence rates in deterministic (Nesterov, 2018), finite-sum (Allen-Zhu, 2017), and stochastic (Jain et al., 2018) settings. Catalyst (Lin et al., 2015) is a generic framework for accelerated first-order methods that we utilize to accelerate SVRP. Catalyst (Algorithm 3) is essentially an accelerated proximal point method that relies on a solver A to solve the proximal point iterations. We use SVRP (Algorithm 2) as the solver A, and term the resulting algorithm Catalyzed SVRP.

Catalyst gives us fast linear convergence provided that we solve certain regularized subproblems to a high accuracy. To do that, we run SVRP (as the method A) with an appropriate set of parameters and for a fixed number of iterations. The details of our application of Catalyst are given in the supplementary material (Appendix G). The complexity of Catalyzed SVRP is given next.

Theorem 3. (Catalyzed SVRP convergence rate). For Catalyst (Algorithm 3) with SVRP (Algorithm 2) as the inner solver A, suppose Assumptions 1 and 2 hold, and that fis L-smooth. Then for a specific choice of the algorithm parameters, the expected communication complexity E [Tcomm] to reach an accuracy ϵ, up to polylogarithmic factors and absolute constants, is

E [Tcomm] = O M + q

δ µM 3/4 log 1

Improvement over SVRP. Compared to ordinary SVRP, observe that since q

δ µM 3/4 M +(δ/µ)2, Theorem 3 gives a communication complexity that is always better than what Theorem 2 provides up to logarithmic factors. In particular, the rate of Catalyzed SVRP given by (6) is strictly better than the rate of vanilla SVRP when δ

M. Note that unlike the communication complexity given by Theorem 2, the rate given by Theorem 3 requires smoothness, but the smoothness constant L only shows up inside a logarithmic factor, and hence does not show up in eq. (6) (as O notation hides polylogarithmic factors).

Comparison with the smooth setting. When each function is L-smooth and µ-strongly convex,

Accelerated variants of SVRG reach an ϵ-accurate solution in O M + q

L µ M 1/2 communication steps and O(1) local work per node (Lin et al., 2015). The communication complexity Catalyzed SVRP achieves is better when δ L

M , at the cost of more local computation.

Improvement over prior work. The best existing algorithms for solving Problem (1) under Assumption 1 are Accelerated SONATA (Tian et al., 2022) and Extragradient sliding (Kovalev et al., 2022), both of which achieve a communication complexity of O q

δ µM . The rate given by eq. (6) is better than this rate, achieving a smaller dependence on the number of nodes M. Arjevani & Shamir

(2015) give a lower bound matching the rate O q

δ µM : ignoring polylogarithmic factors, our result improves upon their lower bound through the usage of client sampling. This improvement is possible because Arjevani & Shamir (2015) use an oracle model that does not allow for client sampling. Thus, Catalyzed SVRP improves upon all existing methods for optimization under Assumption 1, smoothness, and strong convexity when δ is sufficiently small.

Computational complexity. Catalyzed SVRP uses SVRP as an inner solver, and hence inherits its computational requirements: at each iteration k, we have to sample a node mk and evaluate the proximal operator associated with its local objective fmk. However, we set the solution accuracy to b = 0 when applying Catalyst to SVRP, i.e. we require exact stochastic proximal operator evaluation; This is only done for convenience of analysis, and we believe the analysis can be extended to the case of nonzero but small b. In practice, the proximal operation can be computed exactly in many cases,

Published as a conference paper at ICLR 2023

0 2000 4000 6000 8000 10000 Communications

SVRG Accelerated Extragradient SVRP SCAFFOLD

0 2000 4000 6000 8000 10000 12000 Communications

SVRG Accelerated Extragradient SVRP SCAFFOLD

0 2000 4000 6000 8000 10000 12000 Communications

SVRG Accelerated Extragradient SVRP SCAFFOLD

0 2000 4000 6000 8000 10000 Communications

SVRG Accelerated Extragradient SVRP SCAFFOLD

0 2000 4000 6000 8000 10000 Communications

SVRG Accelerated Extragradient SVRP SCAFFOLD

0 2000 4000 6000 8000 10000 Communications

SVRG Accelerated Extragradient SVRP SCAFFOLD

Figure 1: Top row: experiments on synthetic data. Left figure has M = 1000 clients, middle has M = 2000, and right has M = 3000. Bottom row: experiments on real data. Left has M = 20 clients, middle has M = 40 clients, and right has M = 60 clients. We plot the squared distance from the optimum point for all methods on a logarithmic scale versus the number of communication steps, where one exchange of vectors between the server and a single client counts as a communication step.

and otherwise we can compute the proximal operator to machine accuracy using accelerated gradient descent.

5 EXPERIMENTS

We run linear regression with ℓ2 regularization, where each client has a loss function of the form

f(x) = 1 M PM m=1 h fm(x) = 1

n Pn i=1(z m,ix ym,i)2 + λ

where zm,i Rd and ym,i R represent the feature and label vectors respectively, for m [M] and i [n]. We do two sets of experiments: in the first set, we generate the data vectors zm,i synthetically and force the second-order similarity assumption to hold with δ small relative to L, with L 3330 and δ 10, and regularization constant λ = 1. We vary the number of clients M as M {1000, 2000, 3000}. In the second set, we use the a9a dataset from LIBSVM (Chang & Lin, 2011), each client s data is constructed by sampling from the original training dataset with n = 2000 samples per client. We vary the number of clients M as M {20, 40, 60}, and measure L 6.33, δ 0.22, and set the regularization parameter as λ = 0.1. We simulate our results on a single machine, running each method for 10000 communication steps. Our results are given in Figure 1. We compare SVRP against SVRG, SCAFFOLD, and the Accelerated Extragradient algorithms, using the optimal theoretical stepsize for each algorithm. In all cases, SVRP achieves superior performance to existing algorithms for both the real and synthetic data experiments. This is more pronounced in the synthetic data experiments as δ is much smaller than L and the number of agents M is very large.

6 CONCLUSION

In this paper, we develop two algorithms that utilize client sampling in order to reduce the amount of communication necessary for federated optimization under the second-order similarity assumption, with the faster of the two algorithms reducing the best-known communication complexity from O q

δ µM (Kovalev et al., 2022) to O q

δ µM 3/4 + M . Both algorithms utilize variancereduction on top of the stochastic proximal point algorithm, for which we also provide a new simple and smoothness-free analysis.

In all cases, the algorithms tradeoff more local work for a reduced communication complexity. An important direction in future research is to investigate whether this tradeoff is necessary, and to derive lower bounds under the partial participation communication model and second-order similarity.

Published as a conference paper at ICLR 2023

REPRODUCIBILITY STATEMENT

All the theoretical results we derive are accompanied by full proofs in the appendices, and all the definitions and assumptions we make are clearly stated in the main text. We attach the code used to run the experiments as supplementary material to the paper.

Zeyuan Allen-Zhu. Katyusha: the first direct acceleration of stochastic gradient methods. In Hamed Hatami, Pierre Mc Kenzie, and Valerie King (eds.), Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pp. 1200 1205. ACM, 2017. doi: 10.1145/3055399.3055448. URL https: //doi.org/10.1145/3055399.3055448. (Cited on pages 6 and 8)

Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1756 1764, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/ 7fec306d1e665bc9c748b5d2b99a6e97-Abstract.html. (Cited on pages 2, 3, 7, and 8)

Kavosh Asadi, Rasool Fakoor, Omer Gottesman, Taesup Kim, Michael L. Littman, and Alexander J. Smola. Proximal iteration for deep reinforcement learning. ar Xiv preprint ar Xiv:2112.05848, abs/2112.05848, 2021. URL https://ar Xiv.org/abs/2112.05848. (Cited on page 4)

Hilal Asi and John C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257 2290, 2019. doi: 10.1137/ 18m1230323. URL http://dx.doi.org/10.1137/18M1230323. (Cited on pages 4, 5, and 22)

Hilal Asi, Karan N. Chadha, Gary Cheng, and John C. Duchi. Minibatch stochastic approximate proximal point methods. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/ hash/fa2246fa0fdf0d3e270c86767b77ba1b-Abstract.html. (Cited on page 4)

Amir Beck. First-order methods in optimization. MOS-SIAM series on optimization. Society for Industrial and Applied Mathematics ; Mathematical Optimization Society, 2017. ISBN 9781611974997. (Cited on pages 29, 30, and 31)

Dimitri P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163 195, 2011. doi: 10.1007/s10107-011-0472-0. URL https://doi.org/10.1007/s10107-011-0472-0. (Cited on page 3)

Aleksandr Beznosikov and Alexander Gasnikov. Compression and data similarity: Combination of two techniques for communication-efficient solving of distributed variational inequalities. ar Xiv preprint ar Xiv:2206.09446, abs/2206.09446, 2022. URL https://ar Xiv.org/abs/2206. 09446. (Cited on page 4)

Aysegul Bumin and Kejun Huang. Efficient implementation of stochastic proximal point algorithm for matrix and tensor completion. In 2021 29th European Signal Processing Conference (EUSIPCO), pp. 1050 1054. IEEE, 2021. (Cited on page 4)

Karan N. Chadha, Gary Cheng, and John C. Duchi. Accelerated, optimal and parallel: Some results on model-based stochastic optimization. In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 2811 2827. PMLR, 2022. (Cited on page 4)

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1 27:27, 2011. Software available at

http://www.csie.ntu.edu.tw/ cjlin/libsvm. (Cited on page 9)

Published as a conference paper at ICLR 2023

El Mahdi Chayti and Sai Praneeth Karimireddy. Optimization with access to auxiliary information. ar Xiv preprint ar Xiv:2206.00395, abs/2206.00395, 2022. URL https://ar Xiv.org/abs/ 2206.00395. (Cited on pages 2, 3, and 17)

Gong Chen and Marc Teboulle. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization, 3(3):538 543, 1993. doi: 10.1137/0803026. URL https://doi.org/10.1137/0803026. (Cited on page 19)

Wenlin Chen, Samuel Horv ath, and Peter Richt arik. Optimal client sampling for federated learning. Transactions on Machine Learning Research, 2022. URL https://openreview.net/ forum?id=8Gv RCWKHIL. (Cited on page 2)

Aaron Defazio. A simple practical accelerated method for finite sums. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 676 684, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/ 4f6ffe13a5d75b2d6a3923922b3922e5-Abstract.html. (Cited on page 7)

Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 1646 1654, 2014. URL https://proceedings.neurips.cc/paper/2014/hash/ ede7e2b6d13a41ddf9f4bdef84fdc737-Abstract.html. (Cited on page 6)

Radu-Alexandru Dragomir, Mathieu Even, and Hadrien Hendrikx. Fast stochastic bregman gradient methods: Sharp analysis and variance reduction. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 2815 2825. PMLR, 2021. URL http://proceedings.mlr.press/v139/dragomir21a.html. (Cited on page 3)

J. J. Duistermaat. Multidimensional real analysis. Cambridge University Press, 2004. ISBN 9780521551144. (Cited on page 18)

Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, and Blake E. Woodworth. The complexity of making the gradient small in stochastic convex optimization. In Alina Beygelzimer and Daniel Hsu (eds.), Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pp. 1319 1345. PMLR, 2019. URL http://proceedings.mlr.press/v99/foster19b.html. (Cited on page 6)

Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/ hash/e32cc80bf07915058ce90722ee17bb71-Abstract.html. (Cited on page 17)

Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richt arik. SGD: General analysis and improved rates. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5200 5209. PMLR, 2019. URL http://proceedings.mlr.press/v97/qian19b.html. (Cited on page 5)

Hadrien Hendrikx, Lin Xiao, S ebastien Bubeck, Francis R. Bach, and Laurent Massouli e. Statistically preconditioned accelerated gradient method for distributed optimization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 4203 4227. PMLR, 2020. URL http://proceedings.mlr.press/v119/hendrikx20a.html. (Cited on pages 2, 3, and 17)

Published as a conference paper at ICLR 2023

Samuel Horv ath, Maziar Sanjabi, Lin Xiao, Peter Richt arik, and Michael Rabbat. Fed Shuffle: Recipes for better use of local work in federated learning. ar Xiv preprint ar Xiv:2204.13169, abs/2204.13169, 2022. URL https://ar Xiv.org/abs/2204.13169. (Cited on page 3)

Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Accelerating stochastic gradient descent for least squares regression. In S ebastien Bubeck, Vianney Perchet, and Philippe Rigollet (eds.), Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, volume 75 of Proceedings of Machine Learning Research, pp. 545 604. PMLR, 2018. URL http://proceedings.mlr.press/v75/jain18a.html. (Cited on page 8)

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Christopher J. C. Burges, L eon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp. 315 323, 2013. URL https://proceedings.neurips.cc/paper/2013/hash/ ac1dd209cbcc5e5d1c6e28598e8cbbe8-Abstract.html. (Cited on page 6)

Peter Kairouz, H. Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adri a Gasc on, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Koneˇcn y, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancr ede Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Ozg ur, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tram er, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, abs/1912.04977, 2019. URL https://ar Xiv.org/abs/1912.04977. (Cited on pages 1 and 2)

Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. ar Xiv preprint ar Xiv:2008.03606, abs/2008.03606, 2020a. URL https: //ar Xiv.org/abs/2008.03606. (Cited on page 3)

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 5132 5143. PMLR, 2020b. URL http://proceedings.mlr.press/v119/karimireddy20a. html. (Cited on pages 1, 3, 4, 8, and 17)

Junhyung Lyle Kim, Panos Toulis, and Anastasios Kyrillidis. Convergence and stability of the stochastic proximal point algorithm with momentum. In Roya Firoozi, Negar Mehr, Esen Yel, Rika Antonova, Jeannette Bohg, Mac Schwager, and Mykel Kochenderfer (eds.), Proceedings of The 4th Annual Learning for Dynamics and Control Conference, volume 168 of Proceedings of Machine Learning Research, pp. 1034 1047. PMLR, 2022. URL https://proceedings. mlr.press/v168/kim22a.html. (Cited on page 22)

Jakub Koneˇcn y, H. Brendan Mc Mahan, Felix X. Yu, Peter Richt arik, Ananda Theertha Suresh, and Dave Bacon. Federated Learning: Strategies for Improving Communication Efficiency. In NIPS Private Multi-Party Machine Learning Workshop, 2016. (Cited on page 1)

Dmitry Kovalev, Samuel Horv ath, and Peter Richt arik. Don t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Aryeh Kontorovich and Gergely Neu (eds.), Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pp. 451 467. PMLR, 08 Feb 11 Feb 2020. URL https://proceedings.mlr.press/v117/kovalev20a.html. (Cited on page 7)

Published as a conference paper at ICLR 2023

Dmitry Kovalev, Aleksandr Beznosikov, Ekaterina Borodich, Alexander Gasnikov, and Gesualdo Scutari. Optimal gradient sliding and its application to distributed optimization under similarity. ar Xiv preprint ar Xiv:2205.15136, abs/2205.15136, 2022. URL https://ar Xiv.org/abs/ 2205.15136. (Cited on pages 2, 3, 7, 8, and 9)

Bingcong Li, Meng Ma, and Georgios B. Giannakis. On the convergence of SARAH and beyond. In Silvia Chiappa and Roberto Calandra (eds.), The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pp. 223 233. PMLR, 2020a. URL http://proceedings.mlr.press/v108/li20a.html. (Cited on page 7)

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze (eds.), Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. mlsys.org, 2020b. URL https://proceedings.mlsys. org/book/316.pdf. (Cited on pages 1 and 4)

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 3384 3392, 2015. URL https://proceedings.neurips.cc/paper/2015/hash/ c164bbc9d6c72a52c599bbb43d8db8e1-Abstract.html. (Cited on pages 2, 4, 8, 25, 26, and 30)

Ming Liu, Stella Ho, Mengqi Wang, Longxiang Gao, Yuan Jin, and He Zhang. Federated learning meets natural language processing: a survey. ar Xiv preprint ar Xiv:2107.12603, abs/2107.12603, 2021. URL https://ar Xiv.org/abs/2107.12603. (Cited on page 1)

Konstantin Mishchenko, Ahmed Khaled, and Peter Richt arik. Random reshuffling: Simple analysis with vast improvements. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/ hash/c8cc6e90ccbff44c9cee23611711cdc4-Abstract.html. (Cited on page 6)

Konstantin Mishchenko, Ahmed Khaled, and Peter Richt arik. Proximal and federated random reshuffling. In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 15718 15749. PMLR, 2022a. (Cited on page 19)

Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richt arik. Prox Skip: Yes! local gradient steps provably lead to communication acceleration! Finally! In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 15750 15769. PMLR, 2022b. (Cited on page 4)

Deanna Needell, Rachel Ward, and Nathan Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 1017 1025, 2014. URL https://proceedings.neurips.cc/paper/2014/hash/ f29c21d4897f78948b91f03172341b7b-Abstract.html. (Cited on page 5)

Yurii Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition, 2018. ISBN 3319915770. (Cited on pages 6, 8, and 19)

Dinh C. Nguyen, Quoc-Viet Pham, Pubudu N. Pathirana, Ming Ding, Aruna Seneviratne, Zihuai Lin, Octavia A. Dobre, and Won-Joo Hwang. Federated learning for smart healthcare: a survey. ar Xiv preprint ar Xiv:2111.08834, abs/2111.08834, 2021. URL https://ar Xiv.org/abs/2111. 08834. (Cited on page 1)

Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak ac. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Doina Precup and Yee Whye Teh (eds.),

Published as a conference paper at ICLR 2023

Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 2613 2621. PMLR, 2017. URL http://proceedings.mlr.press/v70/nguyen17b.html. (Cited on page 6)

Phuong Ha Nguyen, Lam Nguyen, and Marten van Dijk. Tight dimension independent lower bound on the expected convergence rate for diminishing step sizes in SGD. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ deb54ffb41e085fd7f69a75b6359c989-Paper.pdf. (Cited on page 5)

Andrei P atras cu and Ion Necoara. Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization. ar Xiv preprint ar Xiv:1706.06297, abs/1706.06297, 2017. URL https://ar Xiv.org/abs/1706.06297. (Cited on pages 4 and 22)

Ernest K. Ryu and Stephen Boyd. Stochastic proximal iteration: A non-asymptotic improvement upon stochastic gradient descent, 2014. URL https://stanford.edu/ boyd/papers/ pdf/spi.pdf. (Cited on pages 4 and 22)

Felix Sattler, Klaus-Robert M uller, and Wojciech Samek. Clustered federated learning: Modelagnostic distributed multitask optimization under privacy constraints. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 13, 2020. doi: 10.1109/TNNLS.2020.3015958. (Cited on page 17)

Mark Schmidt, Nicolas Le Roux, and Francis R. Bach. Convergence rates of inexact proximalgradient methods for convex optimization. In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pp. 1458 1466, 2011. URL https://proceedings.neurips.cc/paper/2011/hash/ 8f7d807e1f53eff5f9efbe5cb81090fb-Abstract.html. (Cited on page 29)

Othmane Sebbouh, Nidham Gazagnadou, Samy Jelassi, Francis R. Bach, and Robert M. Gower. Towards closing the gap between the theory and practice of SVRG. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alch e-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 646 656, 2019. URL https://proceedings.neurips.cc/paper/2019/ hash/cd00692c3bfe59267d5ecfac5310286c-Abstract.html. (Cited on page 7)

Osama Shahid, Seyedamin Pouriyeh, Reza M. Parizi, Quan Z. Sheng, Gautam Srivastava, and Liang Zhao. Communication efficiency in federated learning: Achievements and challenges. ar Xiv preprint ar Xiv:2107.10996, abs/2107.10996, 2021. URL https://ar Xiv.org/abs/2107. 10996. (Cited on page 2)

Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pp. 1000 1008. JMLR.org, 2014. URL http://proceedings. mlr.press/v32/shamir14.html. (Cited on pages 2, 3, and 17)

Alex Shtoff. Efficient implementation of incremental proximal-point methods. ar Xiv preprint ar Xiv:2205.01457, abs/2205.01457, 2022. URL https://ar Xiv.org/abs/2205.01457. (Cited on page 4)

Ying Sun, Gesualdo Scutari, and Amir Daneshmand. Distributed optimization based on gradient tracking revisited: Enhancing convergence rate via surrogation. SIAM Journal on Optimization, 32(2):354 385, 2022. doi: 10.1137/19M1259973. URL https://doi.org/10.1137/ 19M1259973. (Cited on pages 2 and 3)

Published as a conference paper at ICLR 2023

Rafał Szlendak, Alexander Tyurin, and Peter Richt arik. Permutation compressors for provably faster distributed nonconvex optimization. In International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/forum?id=Gug Z5Dzz Au. (Cited on page 4)

Ye Tian, Gesualdo Scutari, Tianyu Cao, and Alexander V. Gasnikov. Acceleration in distributed optimization under similarity. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera (eds.), International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pp. 5721 5756. PMLR, 2022. URL https://proceedings.mlr.press/v151/tian22b.html. (Cited on pages 3, 7, and 8)

Panos Toulis, Thibaut Horel, and Edoardo M. Airoldi. The proximal Robbins-Monro method. ar Xiv preprint ar Xiv:1510.00967, abs/1510.00967, 2015. URL https://ar Xiv.org/abs/1510. 00967. (Cited on page 22)

Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan Mc Mahan, Blaise Aguera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horvath, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz, Satyen Kale, Sai Praneeth Karimireddy, Jakub Konecny, Sanmi Koyejo, Tian Li, Luyang Liu, Mehryar Mohri, Hang Qi, Sashank J. Reddi, Peter Richtarik, Karan Singhal, Virginia Smith, Mahdi Soltanolkotabi, Weikang Song, Ananda Theertha Suresh, Sebastian U. Stich, Ameet Talwalkar, Hongyi Wang, Blake Woodworth, Shanshan Wu, Felix X. Yu, Honglin Yuan, Manzil Zaheer, Mi Zhang, Tong Zhang, Chunxiang Zheng, Chen Zhu, and Wennan Zhu. A field guide to federated optimization. ar Xiv preprint ar Xiv:2107.06917, abs/2107.06917, 2021. URL https://ar Xiv.org/abs/2107.06917. (Cited on page 1)

Blake E Woodworth and Nathan Srebro. An even more optimal stochastic optimization algorithm: Minibatching and interpolation learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 7333 7345. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ 3c63ec7be1b6c49e6c308397023fd8cd-Paper.pdf. (Cited on page 6)

Blake E. Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 3639 3647, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/ 645098b086d2f9e1e0e939c27f9f2d6f-Abstract.html. (Cited on pages 1 and 6)

Blake E. Woodworth, Brian Bullins, Ohad Shamir, and Nathan Srebro. The min-max complexity of distributed stochastic convex optimization with intermittent communication. In Mikhail Belkin and Samory Kpotufe (eds.), Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pp. 4386 4437. PMLR, 2021. URL http://proceedings.mlr.press/v134/woodworth21a.html. (Cited on page 1)

Xiao-Tong Yuan and Ping Li. On convergence of distributed approximate newton methods: Globalization, sharper bounds and beyond. ar Xiv preprint ar Xiv:1908.02246, abs/1908.02246, 2019. URL https://ar Xiv.org/abs/1908.02246. (Cited on page 3)

Xiao-Tong Yuan and Ping Li. On convergence of Fed Prox: Local dissimilarity invariant bounds, non-smoothness and beyond. ar Xiv preprint ar Xiv:2206.05187, abs/2206.05187, 2022. URL https://ar Xiv.org/abs/2206.05187. (Cited on page 4)

Yuchen Zhang and Xiao Lin. Di SCO: Distributed optimization for self-concordant empirical loss. In Francis R. Bach and David M. Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp. 362 370. JMLR.org, 2015. URL http://proceedings.mlr. press/v37/zhangb15.html. (Cited on pages 3, 7, and 17)

Published as a conference paper at ICLR 2023

1 Introduction 1

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Related work 3

3 Preliminaries 4

4 Algorithms and theory 4

4.1 Basics: stochastic proximal point method . . . . . . . . . . . . . . . . . . . . . . 4

4.2 The SVRP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.3 Accelerating SVRP via Catalyst . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Experiments 9

6 Conclusion 9

A Basic Facts and Notation 16

B Applications of second-order similarity 17

C Algorithm-independent results 18

C.1 Facts about the proximal operator . . . . . . . . . . . . . . . . . . . . . . . . . . 18

C.2 A lemma for solving recurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

D Proofs for SPPM (Algorithm 1) 20

E Related work for SPPM 22

F Proofs for SVRP (Algorithm 2) 22

G Proofs for Catalyst+SVRP 25

G.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

H Extension to the constrained setting 29

H.1 Proofs for the composite setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

I Client-Server formulations of the algorithms 35

A BASIC FACTS AND NOTATION

We shall make use of the following facts from linear algebra: for any a, b Rd and any ζ > 0,

2 a, b = a 2 + b 2 a b 2. (7)

a 2 (1 + ζ) a b 2 + 1 + ζ 1 b 2. (8)

Published as a conference paper at ICLR 2023

B APPLICATIONS OF SECOND-ORDER SIMILARITY

In this section we give a few examples where Assumption 1 holds. These are known in the literature, and we collect them here for motivation.

Relation to smoothness. The standard assumption in analyzing federated optimization algorithms for Problem 1 is smoothness: Definition 4. (Smoothness) We say that a differentiable function f is L-smooth if for all x, y Rd we have f(x) f(y) L x y .

Note that L-smoothness of each f1, f2, . . . , f M implies Assumption 1 holds with δ = L, as we have for any x, y Rd

m=1 fm(x) f(x) [ fm(y) f(y)] 2

m=1 fm(x) fm(y) 2 f(x) f(y) 2

m=1 fm(x) fm(y) 2

Thus we have that, under smoothness, Assumption 1 holds. While typically we are interested in the case in which δ is much smaller than L, the fact that by default we have δ L means that we do not lose any generality by considering optimization under Assumption 1.

Relation to mean-squared smoothness. Observe that in the preceding proof we did not actually use that each fm is L-smooth, only that

m=1 fm(x) fm(y) 2 L2 x y 2.

This assumption is known as mean-squared smoothness in the literature, and this proof shows it is also a special case of second-order similarity.

Statistical Learning. Suppose that each function fm corresponds to empirical risk minimization with data drawn according to a distribution Dw:

n Pn i=1 ℓ(x, zm,i), (9)

where ℓis L-smooth and convex in its first argument and zm,i represents the i-th training point on node m. If all the training points are drawn i.i.d. from the same distribution zm,i Z for all m [M] and i [n], then if the losses are quadratic, Shamir et al. (2014) show that, with high probability, Assumption 1 holds with δ = O L n . This can be much smaller than L, especially when the number of data points per node n is large. Zhang & Lin (2015) show that a similar concentration holds for non-quadratic minimization, albeit with extra dependence on the data dimensionality d. Hendrikx et al. (2020) remove the dependence on the data dimensionality for generalized linear models under some additional assumptions.

While clients normally do not ordinarily share the same data distribution Dw in federated optimization, clustering clients together such that each group of clients has similar data is a common strategy (Sattler et al., 2020; Ghosh et al., 2020), and as such we can apply algorithms designed for Assumption 1 to clusters of clients with similar data.

Other examples. Karimireddy et al. (2020b) observe that Assumption 1 holds with δ = 0 when using objective perturbation as a differential privacy mechanism for empirical risk minimization, since objective perturbation relies on adding linear noise that does not affect the differences of gradients. Chayti & Karimireddy (2022) give more examples relevant to federated learning.

Published as a conference paper at ICLR 2023

Hessian formulation. The way we wrote Assumption 1 in the main text is

m=1 fm(x) f(x) [ fm(y) f(y)] 2 δ2 x y 2.

For twice-differentiable objectives, this is a biproduct of the following inequality on the Hessians: for all x Rd we have

2fm(x) 2f(x) 2 op δ2.

This motivates the name second-order similarity. To see why this is the case, observe that by Taylor s theorem (Duistermaat, 2004, Theorem 2.8.3), Jensen s inequality and the convexity of the squared norm, we have

fm(x) f(x) [ fm(y) f(y)] 2

2fm(θx + (1 θ)y) 2f(θx + (1 θ)y) (x y)dθ

2fm(θx + (1 θ)y) 2f(θx + (1 θ)y) (x y) 2dθ

2fm(θx + (1 θ)y) 2f(θx + (1 θ)y) 2 op x y 2dθ.

Averaging with respect to m gives

m=1 fm(x) f(x) [ fm(y) f(y)] 2

2fm(θx + (1 θ)y) 2f(θx + (1 θ)y) 2 op x y 2dθ

0 δ2 x y 2dθ

= δ2 x y 2.

Therefore if it holds that maxm [M] supx Rd 2fm(x) f(x) op δ (a condition known as Hessian similarity), we necessarily have that Assumption 1 also holds.

C ALGORITHM-INDEPENDENT RESULTS

This section collects all facts and propositions that are algorithm-independent.

C.1 FACTS ABOUT THE PROXIMAL OPERATOR

In this section we derive two useful facts about the proximal operator. Both facts are relatively straightforward to derive.

Fact 1. Let h : Rd R be a convex differentiable function and η > 0 be a stepsize. Then for all x Rd,

proxηh(x + η h(x)) = x.

Proof. Solving the proximal is equivalent to

proxηh(z) = argmin y Rd

Published as a conference paper at ICLR 2023

This is a strongly convex minimization problem for any η > 0, hence the (necessarily unique) minimizer of this problem satisfies the first order optimality condition

η (y z) = 0.

Now observe that we have

η (x (x + η h(x))) = h(x) + η h(x)

It follows that x = proxηh(x + η h(x)).

Fact 2. (Tight contractivity of the proximal operator). If h is µ-strongly convex and differentiable, then for all η > 0 and for any x, y Rd we have proxηh(x) proxηh(y) 2 1 (1 + ηµ)2 x y 2

Proof. This lemma can be seen as a tighter version of (Mishchenko et al., 2022a, Lemma 5) though our proof technique is different. Note that p(x) = proxηh(x) satisfies η h(p(x)) + [p(x) x] = 0, or equivalently p(x) = x η h(p(x)). Using this we have

p(x) p(y) 2 = [x η h(p(x))] [y η h(p(y))] 2

= [x y] η [ h(p(x)) h(p(y))] 2

= x y 2 + η2 h(p(x)) h(p(y)) 2 2η x y, h(p(x)) h(p(y)) . (10) Now note that x y, h(p(x)) h(p(y)) = p(x) + η h(p(x)) [p(y) + η h(p(y))] , h(p(x)) h(p(y))

= p(x) p(y), h(p(x)) h(p(y)) + η h(p(x)) h(p(y)) 2. (11) Combining eqs. (10) and (11) we get

p(x) p(y) 2 = x y 2 + η2 h(p(x)) h(p(y)) 2 2η p(x) p(y), h(p(x)) h(p(y))

2η2 h(p(x)) h(p(y)) 2

= x y 2 η2 h(p(x)) h(p(y)) 2 2η p(x) p(y), h(p(x)) h(p(y)) . (12)

Let Dh(u, v) = h(u) h(v) h(v), u v be the Bregman divergence associated with h at u, v. It is easy to show that u v, h(u) h(v) = Dh(u, v) + Dh(v, u). This is a special case of the three-point identity (Chen & Teboulle, 1993, Lemma 3.1). Using this with u = p(x) and v = p(y) and plugging back into (12) we get

p(x) p(y) 2 = x y 2 η2 h(p(x)) h(p(y)) 2 2η [Dh(p(x), p(y)) + Dh(p(y), p(x))] .

Note that because h is strongly convex, we have that Dh(p(y), p(x)) µ 2 p(y) p(x) 2 and Dh(p(x), p(y)) µ

2 p(y) p(x) 2, hence

p(x) p(y) 2 x y 2 η2 h(p(x)) h(p(y)) 2 2ηµ p(x) p(y) 2. (13) Strong convexity implies that for any two points u, v

h(u) h(v) 2 µ2 u v 2, see (Nesterov, 2018, Theorem 2.1.10) for a proof. Using this in eq. (13) with u = p(x) and v = p(y) yields

p(x) p(y) 2 x y 2 η2µ2 p(x) p(y) 2 2ηµ p(x) p(y) 2. Rearranging gives 1 + η2µ2 + 2ηµ p(x) p(y) 2 x y 2.

It remains to notice that (1 + ηµ)2 = 1 + η2µ2 + 2ηµ.

Published as a conference paper at ICLR 2023

C.2 A LEMMA FOR SOLVING RECURRENCES

Lemma 1. Suppose that we have a sequence of positive values (rk)K 1 k=0 satisfying, for some θ > 0 and some c > 0

rk+1 1 1 + θ [rk + c] .

Then the sequence satisfies

r K 1 (1 + θ)K r0 + min K

Proof. We start from the recurrence to get

rk+1 1 1 + θrk + c 1 + θ

1 1 + θrk 1 + c 1 + θ

= 1 (1 + θ)2 rk 1 + c (1 + θ)2 + c (1 + θ).

Continuing similarly we obtain

r K 1 (1 + θ)K r0 + c 1 + θ

We can now bound the latter sum in two ways, the first is to note that since 1 + θ > 1 we have that 1 1+θ < 1, and hence

j=0 1j = K. (15)

The second way is to use the convergence of the geometric series

j = 1 1 1 1+θ = 1 + θ

Using eqs. (15) and (16) in (14) gives

r K 1 (1 + θ)K r0 + c 1 + θ min K, 1 + θ

= 1 (1 + θ)K r0 + c min K

and this is the lemma s statement.

D PROOFS FOR SPPM (ALGORITHM 1)

Proof of Theorem 1. Using eq. (8) and our assumption that the proximal operators are solved up to the accuracy b we have

xk+1 x 2 = xk+1 proxηfξk (xk) + proxηfξk (xk) x 2

xk+1 proxηfξk (xk) 2 + (1 + ηµ) proxηfξk (xk) x 2

b + (1 + ηµ) proxηfξk (xk) x 2 . (17)

Published as a conference paper at ICLR 2023

For the second term in eq. (17) we have by Fact 1 that x = proxηfξk (x + η fξk(x )), then using the contraction of the prox (Fact 2) we get proxηfξk (xk) x 2 = proxηfξk (xk) proxηfξk (x + η fξk(x )) 2

1 (1 + ηµ)2 xk (x + η fξk(x )) 2. (18)

Expanding out the square we have proxηfξk (xk) x 2 1 (1 + ηµ)2 xk x η fξk(x ) 2

= 1 (1 + ηµ)2

h xk x 2 + η2 fξk(x ) 2 2η xk x , fξk(x ) i .

Denote expectation conditional on xk by Ek [ ]. Taking expectation conditional on xk we get

proxηfξk (xk) x 2 1 (1 + ηµ)2

h xk x 2 + η2E h fξk(x ) 2i 2η xk x , Ek [ fξk(x )] i

= 1 (1 + ηµ)2

h xk x 2 + η2σ2 i ,

where we used that E [ fξk(x )] = f(x ) = 0 and the definition of σ2 . Taking conditional expectation in eq. (17) and plugging the last line gives

Ek h xk+1 x 2i 1 + ηµ

b + 1 + ηµ (1 + ηµ)2

h xk x 2 + η2σ2 i

xk x 2 + η2σ2 + (1 + ηµ)2

Taking unconditional expectation gives

E h xk+1 x 2i 1 1 + ηµ

E h xk x 2i + η2σ2 + (1 + ηµ)2

We can then use Lemma 1 to get that at step K we have

E h x K x 2i 1 1 + ηµ

K x0 x 2 + ησ2 µ + (1 + ηµ)2

(ηµ)2 b. (19)

This proves the first statement of the theorem. For the second statement, observe that when η = µϵ 2σ2 and b ϵ

(1+ηµ)2 we have

ησ2 µ + (1 + ηµ)2

Moreover, we have using the inequality 1 x exp( x) for all x > 0 and our choice of K that

K = 1 ηµ 1 + ηµ

x0 x 2 . (21)

Plugging eqs. (20) and (21) into eq. (19) yields

E h x K x 2i ϵ,

and this is the second statement of the theorem.

Published as a conference paper at ICLR 2023

E RELATED WORK FOR SPPM

Toulis et al. (2015) study SPPM where they assume each stochastic proximal operation has unbiased and bounded error from the proximal operation with respect to the full proximal, i.e. for ϵξ(x) =

1 η h proxηfξ(x) proxηf(x) i we have:

E [ϵξ] = 0, and, E h ϵξ 2i σ2. (22)

Under this assumption, Toulis et al. (2015) prove that the stochastic proximal point method can achieve a rate that is completely independent of the smoothness constant, and which matches the optimal rate for eq. (2) for (potentially) nonsmooth objectives. Kim et al. (2022) show a similar result under the same condition for a momentum variant of SPPM. It is not clear how to satisfy (22) in practice: the next example shows that even in the simple setting of quadratic minimization, the iterations are not unbiased.

Example 1. Take f1 = ax2, f2 = 2ax2, and f = 1

2(f1 + f2), then for any x Rd we have for ξ drawn uniformly at random from {1, 2}:

E h proxηfξ(x) proxηf(x) i = (1 + 3ηa)x 1 + 6ηa + 8η2a2 x 3ηa + 1 = 0.

Thus the errors are not unbiased. Moreover, it can be shown that the variance scales with x 2, and thus can be made very large.

In contrast, the convergence rate given by Theorem 1 does not require condition (22), and results in linear convergence for the functions of Example 1 (as both f1 and f2 share a minimizer at x = 0). We note that Ryu & Boyd (2014) also study the convergence of SPPM without a condition like (22), but their theory does not show convergence to an ϵ-approximate solution even if the stepsize is taken proportional to ϵ. P atras cu & Necoara (2017) also analyze the convergence of SPPM and present two different cases: if the stepsize is held constant (as in our Theorem 1), then their theory shows only convergence to a neighborhood whose size cannot be made small by varying the stepsize; Alternatively, by using a decreasing stepsize they can show a O 1

ϵ iteration complexity, but then this complexity requires that each fξ is smooth. As mentioned previously, Asi & Duchi (2019, Proposition 5.3) show the same O 1

ϵ rate without requiring smoothness or a condition like (22), but using exact evaluations of the proximal operator. Theorem 1 gives the same iteration complexity without requiring smoothness or bounded variance, and while allowing for approximate proximal point operator evaluations. Note that the improvement of allowing approximate evaluations over (Asi & Duchi, 2019) is not very significant, as the approximation has to be very tight. The main way we depart from (Asi & Duchi, 2019) is that we use the contractivity of the proximal as the main proof tool, and this extends more easily to the variance-reduced setting of SVRP.

F PROOFS FOR SVRP (ALGORITHM 2)

Proof of Theorem 2. Let xk+1 = proxηfmk (xk ηgk). Then by eq. (8) and our assumption that

xk+1 xk+1 2 b we have for any a > 0

xk+1 x 2 = xk+1 xk+1 + xk+1 x 2

1 + a 1 xk+1 xk+1 2 + (1 + a) xk+1 x 2

1 + a 1 b + (1 + a) xk+1 x 2.

Plugging in a = η2µ2

1+2ηµ we get

xk+1 x 2 1 + ηµ

2 b + (1 + ηµ)2

1 + 2ηµ xk+1 x 2. (23)

Published as a conference paper at ICLR 2023

For the second term in eq. (23), we have by Fact 1 that x = proxηfmk (x + η fmk(x )), then using Fact 2 we get

xk+1 x 2 = proxηfmk (xk ηgk) proxηfmk (x + η fmk(x )) 2

1 (1 + ηµ)2 xk ηgk (x + η fmk(x )) 2.

Expanding out the square we have

xk+1 x 2 1 (1 + ηµ)2 xk x η (gk + fmk(x )) 2

= 1 (1 + ηµ)2

h xk x 2 + η2 gk + fmk(x ) 2 2η xk x , gk + fmk(x ) i .

We denote by Ek [ ] the expectation conditional on all information up to (and including) the iterate xk, then

Ek h xk+1 x 2i 1 (1 + ηµ)2 [ xk x 2 + η2Ek h gk + fmk(x ) 2i

2η xk x , Ek [gk + fmk(x )] ], (24)

where in the last term the expectation went inside the inner product since the expectation is conditioned on knowledge of xk, and the randomness in m is independent of xk. Note that this expectation can be computed as

Ek [gk + fmk(x )] = Ek [ f(wk) fmk(wk) + fmk(x )] = f(wk) f(wk) + f(x ) = 0 + 0 = 0,

where we used that since x minimizes f we must have f(x ) = 0. Plugging this into (24) gives

Ek h xk+1 x 2i 1 (1 + ηµ)2

h xk x 2 + η2Ek h gk + fmk(x ) 2ii . (25)

For the second term, we can add f(x ) term inside (as it is a zero) to get

Ek h gk + fmk(x ) 2i = Ek h gk + fmk(x ) f(x ) 2i

= Ek h f(wk) fmk(wk) + fmk(x ) f(x ) 2i

= Ek h f(wk) fm(wk) [ f(x ) fm(x )] 2i

m=1 f(wk) fm(wk) [ f(x ) fm(x )] 2. (26)

Using Assumption 1 with eq. (26) we have

Ek h gk + fm(x ) 2i 1

m=1 f(wk) fm(wk) [ f(x ) fm(x )] 2

Hence we can bound (25) as

Ek h xk+1 x 2i 1 (1 + ηµ)2

h xk x 2 + η2δ2 wk x 2i . (27)

Taking conditional expectation in eq. (23) and plugging the estimate of eq. (27) in we get

Ek h xk+1 x 2i 1 + ηµ

2 b + (1 + ηµ)2

1 + 2ηµ Ek h xk+1 x 2i

2 b + 1 1 + 2ηµ

h xk x 2 + η2δ2 wk x 2i . (28)

Published as a conference paper at ICLR 2023

Observe that by design we have

Ek h wk+1 x 2i = p xk+1 x 2 + (1 p) wk x 2. (29)

p , then using eqs. (28) and (29) we have

Ek h xk+1 x 2i + αEk h wk+1 x 2i = (1 + αp)Ek h xk+1 x 2i + α(1 p) wk x 2

h xk x 2 + η2δ2 wk x 2i + α(1 p) wk x 2 + (1 + αp) 1 + ηµ

h xk x 2 + η2δ2 wk x 2i + α(1 p) wk x 2 + (1 + ηµ)3

1 + 2ηµ xk x 2 + α 1 p + η2δ2(1 + αp)

wk x 2 + (1 + ηµ)3

1 + 2ηµ xk x 2 + α 1 p + pηδ2

µ 1 + ηµ 1 + 2ηµ

wk x 2 + (1 + ηµ)3

(ηµ)2 b. (30)

Note that by condition on the stepsize we have ηδ2/µ 1

2 1 + ηµ 1 + 2ηµ 1

Using this in the second term of eq. (30) gives

Ek h xk+1 x 2i + αEk h wk+1 x 2i

1 + 2ηµ xk x 2 + α 1 p + p

wk x 2 + (1 + ηµ)3

1 + 2ηµ xk x 2 + α 1 p

wk x 2 + (1 + ηµ)3

1 + 2ηµ, 1 p

h xk x 2 + α wk x 2i + (1 + ηµ)3

Define the Lyapunov function Vk = xk x 2 + ηµ

p wk x 2. Then the last equation can simply be written as

Ek [Vk+1] max 1 + ηµ

1 + 2ηµ, 1 p

Vk + (1 + ηµ)3

Taking unconditional expectation gives

E [Vk+1] max 1 + ηµ

1 + 2ηµ, 1 p

E [Vk] + (1 + ηµ)3

Let τ = min{ ηµ 1+2ηµ, p

2}, then max n 1+ηµ 1+2ηµ, 1 p

2 o = 1 τ, and we get the simple recursion

E [Vk+1] (1 τ)E [Vk] + (1 + ηµ)3

Iterating this for k steps and using the formula for the sum of the geometric series gives for any k K,

E [Vk] (1 τ)k E [V0] + (1 + ηµ)3

(1 τ)k E [V0] + (1 + ηµ)3

= (1 τ)k E [V0] + (1 + ηµ)3

(ηµ)2τ b. (31)

Published as a conference paper at ICLR 2023

Now note that

E h xk x 2i E [Vk] . (32)

And by initialization we have w0 = x0, hence

E [V0] = x0 x 2 + ηµ

p w0 x 2 = 1 + ηµ

x0 x 2. (33)

Plugging eqs. (32) and (33) into eq. (31) gives for any k K,

E h xk x 2i 1 + ηµ

(1 τ)k x0 x 2 + (1 + ηµ)3

(ηµ)2τ b. (34)

For the second statement of the theorem, observe that by assumption on b we can bound the right hand side of eq. (34) as

E h xk x 2i 1 + ηµ

(1 τ)k x0 x 2 + ϵ

Next, we use that for all x 0 we have 1 x exp( x), hence we have after K iterations of SVRP that

E h x K x 2i 1 + ηµ

exp( τK) x0 x 2 + ϵ

Thus if we run for the following number of iterations

2 x0 x 2 1 + ηµ

We get that E h x K x 2i ϵ

2 = ϵ. Choose η = µ 2δ2 and p = 1 M , then eq. (35) reduces to

µ2 + 1, M log

2 x0 x 2 1 + µ2M

This gives the second part of the theorem.

G PROOFS FOR CATALYST+SVRP

The convergence rate of Catalyst is given by the following proposition: Proposition 1. (Catalyst convergence rate). Run Catalyst (Algorithm 3) with smoothing parameter γ for a µ-strongly convex function f. Let q = µ µ+γ and choose ρ q. Assume at each timestep t = 1, 2, . . . , T we have that xt from eq. (36) satisfies

E ht(xt) min x Rd ht(x) ϵt def = 2

9 (f(x0) f(x )) (1 ρ)t. (37)

Then the iterates generated by Algorithm 3 satisfy

E [f(xt) f(x )] 8 ( q ρ)2 (1 ρ)t+1 (f(x0) f(x )). (38)

Proof. This is (Lin et al., 2015, Theorem 3.1).

We see that in order to apply Catalyst, we have to solve the problem point iterations of eq. (36) up to the accuracy given by eq. (37). However, the accuracy ϵt depends on the suboptimality gap f(x0) f(x ), which we do not have access to in general. The next proposition shows that this is not a problem for methods with linear convergence, and we can instead run the method for a fixed number of iterations:

Published as a conference paper at ICLR 2023

Algorithm 3: Catalyst Data: Initialization x0, smoothing parameter γ, algorithm A, strong convexity constant µ.

1 Initialize y0 = x0.

2 Initialize q = µ µ+γ and α0 = q.

3 for t = 1, 2, . . . , T 1 do

4 Find xt using A starting from the initialization xt 1:

xt argmin x Rd

n ht(x) def = f(x) + γ

2 x yt 1 2o . (36)

5 Update αt (0, 1) as

α2 t = (1 αt)α2 t 1 + qαt.

6 Compute yt using an extrapolation step

yt = xt + βt(xt xt 1) with βt = αt 1(1 αt 1)

α2 t 1 + αt .

Proposition 2. (Inner method complexity in Catalyst). Consider Algorithm 3 run with smoothing parameter γ on a µ-strongly convex function f, with q = µ µ+γ and ρ q. Suppose that the method A generates iterates (zs)s 0 such that

E ht(zs) min x Rd ht(x) A(1 τA,h)s ht(z0) min x Rd ht(x) , (39)

then the precision ϵt from eq. (37) is reached in expectation when the number of iterations s of A exceeds TA where

TA = 1 τA,h log A 2 1 ρ + 2592γ µ(1 ρ)2( q ρ)2

Proof. This is (Lin et al., 2015, Proposition 3.2).

Thus applying Catalyst to accelerating SVRP reduces to verifying if the condition in (39) holds for SVRP. The next proposition shows it does. Proposition 3. Define ht as in eq. (36) with γ + µ δ, and let the objective f : Rd R be a finite sum (as in eq. (1)) where each fm is µ-strongly convex, f is L-smooth, and Assumption 1 holds. Use SVRP (Algorithm 2) as the solver A to minimize ht, then the iterates z1, z2, . . . , zs generated by SVRP with stepsize η = µ+γ

2δ2 , solution accuracy b = 0, and communication probability p = 1 M satisfy

E ht(zs) min x Rd ht(x) A(1 τ)s ht(z0) min x Rd ht(x) ,

A def = L + γ

1 + (γ + µ)2M

, and, τ def = 1

δ2 (γ+µ)2 + 1 , 1

Proof. The function ht (from Algorithm 3) is defined in eq. (36) as

ht(x) = f(x) + γ

2 x yt 1 2.

By the finite-sum structure, we can write ht as

h fm(x) + γ

2 x yt 1 2i , (40)

Published as a conference paper at ICLR 2023

where each fm is µ-strongly convex. For each m [M], define ht,m(x) def = fm(x) + γ

2 x yt 1 2. Note that ht,m is µ + γ-strongly convex. Moreover, by direct computation we have for any x, y Rd that

ht,m(x) ht(x) = fm(x) + γ(x yt 1) ( f(x) + γ(x yt 1)) = fm(x) f(x).

Thus using this combined with Assumption 1 we have

m=1 ht,m(x) ht(x) [ ht,m(y) ht(y)] 2

m=1 fm(x) f(x) [ fm(y) f(y)] 2

It follows that problem eq. (40) satisfies the conditions of Theorem 2 on the convergence of SVRP, and thus specializing the theorem we get for the iterates z1, z2, . . . , zs by SVRP with stepsize η = µ+γ

2δ2 , solution accuracy b = 0, and communication probability p = 1 M that

E h zs z 2i 1 + (µ + γ)2M

(1 τ)s z0 z 2, (41)

where τ = 1

(µ+γ)2 +1, 1

and z = arg minx Rd ht(x). Note that because f is L-smooth

and µ-strongly convex, we have that ht is L + γ-smooth and µ + γ-strongly convex, hence for any x Rd we have

ht(x) ht(z ) µ + γ

2 x z 2 (42)

ht(x) ht(z ) L + γ

2 x z 2. (43)

Using (42) with x = z0 and (43) with x = zs and combining with (41) we obtain

ht(zs) ht(z ) L + γ

1 + (µ + γ)2M

(1 τ)s z0 z 2

1 + (µ + γ)2M

(1 τ)s (ht(z0) ht(z )) ,

where τ = 1

(µ+γ)2 +1, 1

and z minimizes ht.

G.1 PROOF OF THEOREM 3

Proof. The proof of this theorem is a straighfrforward combination of Propositions 1, 2, and 3. We

set p = 1 M , b = 0, and shall set η and γ later. In Proposition 1, choose ρ = q

2 , then the convergence guarantee in eq. (38) is

E [f(xt) f(x )] 32

t+1 (f(x0) f(x ))

= 32(µ + γ)

t+1 (f(x0) f(x ))

2 (t + 1) (f(x0) f(x )),

Published as a conference paper at ICLR 2023

where in the last line we used that 1 x exp( x). Thus in order to get an ϵ-approximate solution the the number of iterations of Catalyst TCatalyst iter should be

TCatalyst iter = 2 q log f(x0) f(x )

ϵ 32(µ + γ)

µ log f(x0) f(x )

ϵ 32(µ + γ)

By Proposition 2, for a method A satisfying eq. (39) we need TA inner loop iterations, where TA is defined as

TA = 1 τA,h log A 2 1 ρ + 2592γ µ(1 ρ)2( q ρ)2

where τA,h and A are defined as in eq. (39). By Proposition 3 we have that for SVRP with γ + µ δ and stepsize η = µ+γ

2δ2 and communication probability p = 1 M that eq. (39) holds with

1 + (γ + µ)2M

, and, τ = min

( 1 2 δ2 (γ+µ)2 + 2 , 1 2M

Thus the number of iterations of SVRP TA is

TA = max 2 δ2

(γ + µ)2 + 2, 2M log A 2 1 ρ + 2592γ µ(1 ρ)2( q ρ)2

Thus the total number of SVRP iterations is TA TCatalyst iter which is

T total iter = 2 rµ + γ

(γ + µ)2 + 2, 2M log A 2 1 ρ + 2592γ µ(1 ρ)2( q ρ)2

log f(x0) f(x )

ϵ 32(µ + γ)

Let ι collect all the log factors, and use the looser estimate

(γ + µ)2 + 2, 2M 4 max δ2

(γ + µ)2 , M

which holds because γ + µ δ in all cases. Thus we get an ϵ-accurate solution if the total number of iterations is equal to or exceeds

T total iter = 8 rγ + µ

(γ + µ)2 , M ι.

We now have two cases: (a) if δ

M, then the choice γ = q

δ2 M µ gives

T total iter = 8M 3/4 s

Note that under this choice of γ we have γ + µ = δ

M δ, and hence the precondition of Proposition 3 holds. (b) Otherwise, choosing γ = 0 yields

T total iter = 8Mι. (45)

Here we have γ + µ = µ δ by assumption, and hence the precondition of Proposition 3 holds, and our usage of it is justified. Thus we reach an ϵ-accurate solution in both cases when the total number of iterations satisfies

T total iter = 8ι max

Finally, it remains to notice that the expected number of communication steps (by the same reasoning as in Section 4.2) is E T total comm = (2 + 3p M)T total iter = 5T total iter .

Published as a conference paper at ICLR 2023

H EXTENSION TO THE CONSTRAINED SETTING

We consider the constrained problem defined as follows: let R be a convex constraint function with an easy-to-compute proximal operator (i.e. we can compute prox R( ) easily), then the composite finite-sum minimization problem is

F(x) = f(x) + R(x) = 1

m=1 fm(x) + R(x)

The constrained optimization problem minx C f(x) can be reduced to problem (46) by letting R = δC be the indicator function on the closed convex set C, and the proximal operator associated with R in this case reduces to the projection on C (Beck, 2017, Theorem 6.24).

We shall use the following variant of SVRP to solve this problem:

Algorithm 4: SVRP for composite optimization Data: Stepsize η, initialization x0, number of steps K, communication probability p, local solution accuracy b.

1 Initialize w0 = x0.

2 for k = 0, 1, 2, . . . , K 1 do

3 Sample mk uniformly at random from [M].

gk = f(wk) fmk(wk).

5 Compute a b-approximation of the stochastic proximal point operator associated with fmk:

xk+1 proxηfmk +ηR (xk ηgk) . (47)

6 Sample ck Bernoulli(p) and update wk+1 = xk+1 if ck = 1, wk if ck = 0.

The following theorem gives the convergence rate of Algorithm 4: Theorem 5. (Convergence of SVRP in the composite setting). Suppose that Assumptions 1 and that each fmk is µ-strongly convex, and let x be the minimizer of Problem (46). Suppose that each xk+1 is

a b-approximation of the proximal (47). Let τ = min n ηµ 1+2ηµ, p

2 o . Set the parameters of Algorithm 4

as η = µ 2δ2 , b ϵτ(ηµ)2

2(1+ηµ)3 , and p = 1 M . Then the final iterate x K satisfies E h x K x 2i ϵ provided that the total number of iterations K is larger than Titer:

Titer = O M + δ2

We first discuss the computational and communication complexities incurred by Algorithm 4 and then give the proof of Theorem 5 afterwards.

Computational Complexity. Note that this algorithm requires evaluating the proximal operator eq. (47), this is equivalent to solving the local optimization problem

2η x zk 2 + R(x) ,

for zk = xk ηgk. When each fmk is µ-strongly convex, this is a composite convex optimization problem where the smooth part is L+ 1

η-smooth and µ+ 1

η-strongly convex, and where the constraint r has an easy to compute proximal operator. This can be solved by accelerated proximal gradient descent (Schmidt et al., 2011, Proposition 4) to any desired accuracy b in

v u u t L + 1

Published as a conference paper at ICLR 2023

gradient and R-proximal operator accesses.

Communication complexity. The communication cost of Algorithm 4 is exactly the same as that of ordinary SVRP, as the iteration complexity of the method remains exactly the same. Thus, the discussion in Section 4.2 applies here. The communication cost is therefore of order O M + δ2

communications.

Catalyzed SVRP. Catalyst applies out-of-the-box to composite optimization (Lin et al., 2015), provided that we are able to solve the composite problems. As Theorem 5 shows, Algorithm 4 can do that. Therefore we can show that Catalyzed SVRP has the communication complexity O M + M 3 4 q

δ µ in this setting as well. The proof for this rate is a straightforward extension of the proofs in Section G.

H.1 PROOFS FOR THE COMPOSITE SETTING

Fact 3. Let h be a convex function (not necessarily differentiable) and η > 0. Let x Rd, and suppose that g h(x) (that is, g is a subgradient of h at x), then we have

proxηh(x + ηg) = x

Proof. Solving the proximal is equivalent ot

proxηh(z) = arg min y Rd

This is a strongly convex minimization problem for any η > 0, hence, by Fermat s optimality condition (Beck, 2017, Theorem 3.63) the (necessarily unique) minimizer of this problem satisfies the first-order optimality condition

2η y z 2 (48)

Note that for any two proper convex functions f1, f2 on Rd we have that the subdifferential set of their sum (f1 + f2)(x) is the sum of points in their respective subdifferential sets (Beck, 2017, Theorem 3.36), i.e.

(f1 + f2)(x) = {g1 + g2 | g1 f1(x), g2 f2(x)} = f1(x) + f2(x).

Thus the optimality condition eq. (48) reduces to

Now observe that plugging y = x and z = x + ηg and using the fact that g h(x) we have

η (x (x + ηg)) = g + ηg

It follows that 0 h(x) + 1

η [x (x + ηg)] and hence proxηh(x + ηg) = x.

Fact 4. (Tight contractivity of the proximal operator). If h is µ-strongly convex, then for all η > 0 and for any x, y Rd we have proxηh(x) proxηh(y) 2 1 (1 + ηµ)2 x y 2

Proof. This is the nonsmooth generalization of Fact 2. Note that p(x) = proxηh(x) satisfies ηgpx + [p(x) x] = 0 for some gpx h(p(x)), or equivalently p(x) = x ηgpx. Using this we have

p(x) p(y) 2 = [x ηgpx] [y ηgpy] 2

= [x y] η [gpx gpy] 2

= x y 2 + η2 gpx gpy 2 2η x y, gpx gpy . (49)

Published as a conference paper at ICLR 2023

Now note that

x y, gpx gpy = p(x) + ηgpx [p(y) + ηgpy] , gpx gpy

= p(x) p(y), gpx gpy + η gpx gpy 2. (50)

Combining eqs. (49) and (50) we get

p(x) p(y) 2 = x y 2 + η2 gpx gpy 2 2η p(x) p(y), gpx gpy

2η2 gpx gpy 2

= x y 2 η2 gpx gpy 2 2η p(x) p(y), gpx gpy . (51)

Let Dh(u, v, gv) = h(u) h(v) gv, u v be the Bregman divergence associated with h at u, v with subgradient gv h(v). It is easy to show that

u v, gu gv = Dh(u, v, gu) + Dh(v, u, gv).

Using this with u = p(x), v = p(y), gu = gpx and gv = gpy and plugging back into (51) we get

p(x) p(y) 2 = x y 2 η2 gpx gpy 2 2η [Dh(p(x), p(y), gpx) + Dh(p(y), p(x), gpy)] .

Note that because h is strongly convex, we have by (Beck, 2017, Theorem 5.24 (ii)) that Dh(p(y), p(x), gpy) µ

2 p(y) p(x) 2 and Dh(p(x), p(y), gpx) µ

2 p(y) p(x) 2, hence

p(x) p(y) 2 x y 2 η2 gpx gpy 2 2ηµ p(x) p(y) 2. (52)

Strong convexity implies for any two points u, v and gu h(u), gv h(v) that

gu gv, u v µ u v 2

see (Beck, 2017, Theorem 5.24 (iii)) for a proof. Using Cauchy-Schwartz yields

gu gv u v gu gv, u v µ u v 2

Now if u = v, then trivially we have gu gv µ u v , otherwise, we divide both sides of the last inequality by u v to get

gu gv µ u v .

Thus in both cases we get

gu gv 2 µ2 u v 2.

Using this in eq. (52) with u = p(x), v = p(y), gu = gpx and gv = gpy yields

p(x) p(y) 2 x y 2 η2µ2 p(x) p(y) 2 2ηµ p(x) p(y) 2.

Rearranging gives 1 + η2µ2 + 2ηµ p(x) p(y) 2 x y 2.

It remains to notice that (1 + ηµ)2 = 1 + η2µ2 + 2ηµ.

Proof of Theorem 5. Let xk+1 = proxηfmk +ηR(xk ηgk). Then by eq. (8) and our assumption

that xk+1 xk+1 2 b we have for any a > 0

xk+1 x 2 = xk+1 xk+1 + xk+1 x 2

1 + a 1 xk+1 xk+1 2 + (1 + a) xk+1 x 2

1 + a 1 b + (1 + a) xk+1 x 2.

Plugging in a = η2µ2

1+2ηµ we get

xk+1 x 2 1 + ηµ

2 b + (1 + ηµ)2

1 + 2ηµ xk+1 x 2. (53)

Published as a conference paper at ICLR 2023

Now observe that by first-order optimality of x we have

0 F(x ) = f(x ) + R(x )

It follows that f(x ) R(x ), and thus we can apply Fact 3 to get that x = proxηfmk +ηR(x + η fmk(x ) η f(x )). Using this in the second term in eq. (53) followed by Fact 4 we get

xk+1 x 2 = proxηfmk (xk ηgk) proxηfmk (x + η fmk(x ) η f(x )) 2

1 (1 + ηµ)2 xk ηgk (x + η fmk(x ) η f(x )) 2.

Expanding out the square we have

xk+1 x 2 1 (1 + ηµ)2 xk x η (gk + fmk(x ) f(x )) 2

= 1 (1 + ηµ)2

xk x 2 + η2 gk + fmk(x ) f(x ) 2

2η xk x , gk + fmk(x ) f(x ) .

We denote by Ek [ ] the expectation conditional on all information up to (and including) the iterate xk, then

Ek h xk+1 x 2i 1 (1 + ηµ)2 [ xk x 2 + η2Ek h gk + fmk(x ) f(x ) 2i

2η xk x , Ek [gk + fmk(x ) f(x )] ], (54)

where in the last term the expectation went inside the inner product since the expectation is conditioned on knowledge of xk, and the randomness in m is independent of xk. Note that this expectation can be computed as

Ek [gk + fmk(x ) f(x )] = Ek [ f(wk) fmk(wk) + fmk(x ) f(x )] = f(wk) f(wk) + f(x ) f(x ) = 0 + 0 = 0.

Plugging this into (54) gives

Ek h xk+1 x 2i 1 (1 + ηµ)2

h xk x 2 + η2Ek h gk + fmk(x ) f(x ) 2ii . (55)

For the second term, we have

Ek h gk + fmk(x ) f(x ) 2i = Ek h f(wk) fmk(wk) + fmk(x ) f(x ) 2i

= Ek h f(wk) fm(wk) [ f(x ) fm(x )] 2i

m=1 f(wk) fm(wk) [ f(x ) fm(x )] 2.

Using Assumption 1 with eq. (56) we have

Ek h gk + fm(x ) 2i 1

m=1 f(wk) fm(wk) [ f(x ) fm(x )] 2

Hence we can bound (55) as

Ek h xk+1 x 2i 1 (1 + ηµ)2

h xk x 2 + η2δ2 wk x 2i . (57)

Published as a conference paper at ICLR 2023

Taking conditional expectation in eq. (53) and plugging the estimate of eq. (57) in we get

Ek h xk+1 x 2i 1 + ηµ

2 b + (1 + ηµ)2

1 + 2ηµ Ek h xk+1 x 2i

2 b + 1 1 + 2ηµ

h xk x 2 + η2δ2 wk x 2i . (58)

Observe that by design we have

Ek h wk+1 x 2i = p xk+1 x 2 + (1 p) wk x 2. (59)

p , then using eqs. (58) and (59) we have

Ek h xk+1 x 2i + αEk h wk+1 x 2i = (1 + αp)Ek h xk+1 x 2i + α(1 p) wk x 2

h xk x 2 + η2δ2 wk x 2i + α(1 p) wk x 2 + (1 + αp) 1 + ηµ

h xk x 2 + η2δ2 wk x 2i + α(1 p) wk x 2 + (1 + ηµ)3

1 + 2ηµ xk x 2 + α 1 p + η2δ2(1 + αp)

wk x 2 + (1 + ηµ)3

1 + 2ηµ xk x 2 + α 1 p + pηδ2

µ 1 + ηµ 1 + 2ηµ

wk x 2 + (1 + ηµ)3

(ηµ)2 b. (60)

Note that by condition on the stepsize we have ηδ2/µ 1

2 1 + ηµ 1 + 2ηµ 1

Using this in the second term of eq. (60) gives

Ek h xk+1 x 2i + αEk h wk+1 x 2i

1 + 2ηµ xk x 2 + α 1 p + p

wk x 2 + (1 + ηµ)3

1 + 2ηµ xk x 2 + α 1 p

wk x 2 + (1 + ηµ)3

1 + 2ηµ, 1 p

h xk x 2 + α wk x 2i + (1 + ηµ)3

Define the Lyapunov function Vk = xk x 2 + ηµ

p wk x 2. Then the last equation can simply be written as

Ek [Vk+1] max 1 + ηµ

1 + 2ηµ, 1 p

Vk + (1 + ηµ)3

Taking unconditional expectation gives

E [Vk+1] max 1 + ηµ

1 + 2ηµ, 1 p

E [Vk] + (1 + ηµ)3

Let τ = min{ ηµ 1+2ηµ, p

2}, then max n 1+ηµ 1+2ηµ, 1 p

2 o = 1 τ, and we get the simple recursion

E [Vk+1] (1 τ)E [Vk] + (1 + ηµ)3

Published as a conference paper at ICLR 2023

Iterating this for k steps and using the formula for the sum of the geometric series gives for any k K,

E [Vk] (1 τ)k E [V0] + (1 + ηµ)3

(1 τ)k E [V0] + (1 + ηµ)3

= (1 τ)k E [V0] + (1 + ηµ)3

(ηµ)2τ b. (61)

Now note that

E h xk x 2i E [Vk] . (62)

And by initialization we have w0 = x0, hence

E [V0] = x0 x 2 + ηµ

p w0 x 2 = 1 + ηµ

x0 x 2. (63)

Plugging eqs. (62) and (63) into eq. (61) gives for any k K,

E h xk x 2i 1 + ηµ

(1 τ)k x0 x 2 + (1 + ηµ)3

(ηµ)2τ b. (64)

For the second statement of the theorem, observe that by assumption on b we can bound the right hand side of eq. (64) as

E h xk x 2i 1 + ηµ

(1 τ)k x0 x 2 + ϵ

Next, we use that for all x 0 we have 1 x exp( x), hence we have after K iterations of SVRP that

E h x K x 2i 1 + ηµ

exp( τK) x0 x 2 + ϵ

Thus if we run for the following number of iterations

2 x0 x 2 1 + ηµ

We get that E h x K x 2i ϵ

2 = ϵ. Choose η = µ 2δ2 and p = 1 M , then eq. (65) reduces to

µ2 + 1, M log

2 x0 x 2 1 + µ2M

This gives the second part of the theorem.

Published as a conference paper at ICLR 2023

I CLIENT-SERVER FORMULATIONS OF THE ALGORITHMS

The algorithms we gave (Algorithm 1 and Algorithm 2) are written in notation that does not make clear the exact role of server and client in the process. Below, we rewrite the algorithms to make clear the role of the clients and the server, in Algorithm 5 and Algorithm 2. Note that both formulations are exactly equivalent, and this is just for clarity. Both algorithms rely on evaluating the proximal operator approximately using local data: we give an example solution method using gradient descent in Algorithm 7. By standard convergence results for gradient descent, Algorithm 7 halts in at most

ϵ iterations. As discussed in the main text, we can also use accelerated gradient

descent or other algorithms, this example is just for illustration.

Algorithm 5: Stochastic Proximal Point Method (SPPM) (Client-Server formulation) Data: Stepsize η, initialization x0, number of steps K, proximal solution accuracy b.

1 Server communicates stepsize η and desired proximal solution accuracy b to all clients.

2 for k = 0, 1, 2, . . . , K 1 do

3 Server samples client m [M].

4 Server sends model xk to client m.

5 Client m evaluates the local proximal operator up to accuracy b on its local data (e.g. using Algorithm 7): xk+1 proxηfmk (xk).

6 Client m sends the updated model xk+1 to the server.

Published as a conference paper at ICLR 2023

Algorithm 6: Stochastic Variance-Reduced Proximal Point (SVRP) Method (Client-Server formulation) Data: Stepsize η, initialization x0, number of steps K, communication probability p, local solution accuracy b.

1 Initialize w0 = x0.

2 Server communicates stepsize η and desired proximal solution accuracy b to all clients.

3 Server communicates iterate w0 to all clients.

4 Every client m computes its local gradient fm(w0) and sends it back to the server.

5 Server receives the local gradients and averages them: f(w0) = 1 M PM m=1 fm(w0).

6 Server sends f(w0) to all clients.

7 for k = 0, 1, 2, . . . , K 1 do

8 Server samples client mk uniformly at random from [M].

9 Server sends model xk to client mk.

10 Client mk computes its local gradient fmk(wk) with the (cached) model wk and cached full gradient f(wk) and sets

gk = f(wk) fmk(wk).

11 Client mk computes a b-approximation of the proximal point operator on its local data (e.g. using Algorithm 7): xk+1 proxηfmk (xk ηgk) .

12 Client mk sends xk+1 to the server.

13 Server samples ck Bernoulli(p).

14 if ck = 1 then

15 Server notifies clients of update, sets wk+1 = xk+1 and sends it to all clients.

16 Every client m caches the new model wk+1 instead of the old model wk, then computes its local gradient fm(wk+1) and sends it back to the server.

17 Server receives the local gradients and averages them: f(wk+1) = 1 M PM m=1 fm(wk+1).

18 Server sends f(wk+1) to all clients to cache.

20 Clients keep their cached vector wk as it is, setting wk+1 = wk automatically.

Published as a conference paper at ICLR 2023

Algorithm 7: Client m approximate proximal point evaluation using gradient descent. Data: Proximal operator argument z, proximal solution accuracy b, smoothnes constant L, strong convexity constant µ.

1 (This algorithm evaluates proxηfm(z) up to accuracy b using gradient descent.)

2 Set the local stepsize β = 1 L+ 1

3 Set the model y0 = 0.

4 for t = 0, 1, 2, . . . do

5 Compute the gradient

t = fm(yt) + 1

7 Update the model

yt+1 = yt β t.

9 if the gradient norm is small t 2 b h µ + 1

10 Exit and return yt as it is a b-approximate solution, since by strong convexity we have

yt proxηfm(z) 2 t 2