# tighter_analysis_for_proxskip__7666b444.pdf

Tighter Analysis for Prox Skip

Zhengmian Hu 1 2 Heng Huang 1

In this paper, we provide a tighter analysis for Prox Skip, an algorithm that allows fewer proximal operator computations to solve composite optimization problems. We improve the existing decreasing speed of Lyapunov function from O(p2) to O(p), when p, the frequency of the proximal operators is small enough. Our theoretical analysis also reveals the drawbacks of using large step sizes for gradient descent in Prox Skip when the proximal operator part is the bottleneck. Our main motivation comes from the continuous limit in which the original analysis of Prox Skip fails to guarantee convergence when both the step size γ and frequency p tend to zero. We construct a counterexample to demonstrate why such counterintuitive behavior occurs for the original analysis and then propose a novel Lyapunov function variant to construct a tighter analysis, avoiding the problem of the old one. Such a new Lyapunov function can be directly extended to many other variants of Prox Skip. When applied to stochastic gradient setup, our analysis leads to an improved proximal operator complexity for SProx Skip from O( p

1/εµ2 log(1/ε)) to O( κ log(1/ε)).

1. Introduction

Composite optimization is a problem of the form

min x Rd f(x) + ψ(x),

where f : Rd R is a smooth function and ψ : Rd R {+ } is a proper, closed and convex function. In this paper, we focus on the case where f is strongly convex. This type of problems arises in a wide range of applications

1Department of Computer Science, University of Maryland, College Park, MD, USA. 2Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA.. Correspondence to: Zhengmian Hu <huzhengmian@gmail.com>, Heng Huang <henghuanghh@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

in machine learning and statistics modeling (Cand es et al., 2011; Cand es & Recht, 2012; Lustig et al., 2007; Tibshirani, 2011; Bao et al., 2022).

Subgradient-based optimization algorithms are not generally used because of their slow convergence rates. Proximal gradient descent (PGD) (Combettes & Wajs, 2005; Passty, 1979; Nesterov, 2013) is a canonical method for solving composite optimization problems. It is based on the proximal operator proxψ(y) = arg min x Rd

2 x y 2 + ψ(x)

and has a gradient complexity of O(κ log(1/ε)). Accelerated proximal gradient descent (APGD) (Nesterov, 1998; 2013), an accelerated variant of deterministic gradient descent, has a better gradient complexity of O( κ log(1/ε)).

Most prior work focused on the scenario where the cost of the proximal operator is low and the gradient descent part is the bottleneck. In this case, typically the gradient complexity and proximal operator complexity are equal. On the other hand, in federated learning (FL) one needs to minimize the average of multiple different functions f(x) := 1 n Pn i=1 fi(x). It can be put in consensus form (Parikh et al., 2014): f (x1, . . . , xn) := 1 n Pn i=1 fi (xi),

ψ (x1, . . . , xn) :=

( 0, if x1 = = xn, + , otherwise as a spe-

cial case of composite optimization. This attracted attention because the proximal operator proxγψ (x1, . . . , xn) = ( x, . . . , x) Rnd amounts to a single global communication of the parameters from all clients for computing the average. In this context, people sought fewer proximal operator steps and hence fewer communications, rather than focusing on the gradient complexity of optimization.

Local training is a common approach to reduce communication and consists of multiple local gradient descent steps interspersed by a few proximal operator steps. The simplest local training approach, known as Local SGD/Fed Avg (Mangasarian & Solodov, 1993; Mc Donald et al., 2010; Mc Mahan et al., 2016; Zhang et al., 2016; Stich, 2018; Lin et al., 2018), is equivalent to a multi-step GD followed by a single proximal operator step in composite optimization and has been shown to suffer from client drift (Khaled et al., 2019; Karimireddy et al., 2020; Wu et al., 2022). To address this, various methods have been proposed, including Scaffold (Karimireddy et al., 2020), S-Local-GD (Gorbunov

Tighter Analysis for Prox Skip

et al., 2021), and Fed Lin (Mitra et al., 2021), that leverage control variates to obtain linear convergence. However, the theoretical understanding of such methods is still in its early stages, and the communication complexity remains at O(κ log(1/ε)), which is no better than that of PGD.

Recently, Prox Skip (Mishchenko et al., 2022) was proposed to address this issue by randomly applying the proximal operator step. With a tighter analysis, it has successfully achieved an acceleration over PGD even when Nesterov momentum is not used, with an optimal proximal operator complexity of O( κ log(1/ε)). In comparison with Accelerated Proximal Gradient Descent (APGD), it has a worse gradient complexity, but is much simpler in local training and is allowed to have a non-optimal gradient complexity when the bottleneck is only in communication. However, this work does not imply that local training has been thoroughly understood. Specifically, existing analysis is not tight in certain regimes and cannot extrapolate to continuous limits, and only can achieve sublinear convergence with respect to proximal operator complexity when a stochastic gradient is used.

In this paper, we improve upon the above and make contributions as follows:

We provide a tighter analysis of Prox Skip in Section 3, which improves the decreasing speed of Lyapunov function from O(p2) to O(p) when the proximal operator part is the bottleneck.

We reveal an effect of step size that is not present in the previous analysis, which suggests that large step size actually hinders convergence when the proximal operator part is the bottleneck, as illustrated in Section 3.1.

Our analysis embodies the continuous limit as a special case of Prox Skip, which demonstrates that gradient flow can also solve composite optimization collaboratively with a proximal operator and achieves O( κ log(1/ε)) proximal operator complexity, as discussed in Section 4.

We propose a new Lyapunov function and explain in Section 5 why the old one does not fit and what new proving techniques are needed.

In Section 6, we extend our analysis to a variety of Prox Skip variants, including SProx Skip, Prox Skip-VR, and Grad Skip+. This demonstrates the generality of our analysis methodology.

For SProx Skip, we further improve the proximal operator complexity from O( p

1/εµ2 log(1/ε)) to O( κ log(1/ε)).

We remark that this paper only includes basic numerical experimental results, which are separated in different sections, to verify and illustrate our theoretical analysis results. This

Algorithm 1 Prox Skip

1: stepsize γ > 0, probability p > 0, initial iterate x0 Rd, initial control variate h0 Rd, number of iterations T 1 2: for t = 0, 1, ..., T 1 do 3: ˆxt+1 = xt γ( f(xt) ht) 4: Flip a coin θt {0, 1} where Prob(θt = 1) = p 5: if θt = 1 then 6: xt+1 = prox γ

p ψ(ˆxt+1 γ

7: else 8: xt+1 = ˆxt+1 9: end if 10: ht+1 = ht + p

γ (xt+1 ˆxt+1) 11: end for

is for two reasons: firstly, our main contribution is to derive a tighter theoretical analysis and the algorithms are not new; secondly, Prox Skip and other methods studied in this paper are abstract frameworks and have multiple instantiations suitable for different scenarios. Interested readers should refer to the papers where these algorithms were proposed for evaluation and comparison with other algorithms.

2. Preliminary

We start by introducing Prox Skip, the algorithm we are primarily studying in this paper. Algorithm 1 presents the pseudocode of Prox Skip. Mishchenko et al. (2022) proposed Prox Skip and provides a convergence analysis, based on the following Lyapunov function

Ψold t = xt x 2 + γ2

p2 ht h 2. (1)

If this Lyapunov function converges to zero, then xt converges to x and ht converges to h = f(x ). Further, the original paper guarantees the convergence of Ψold t as follows

Theorem 2.1. Let f be µ-strongly convex with positive µ > 0 and L-smooth, and let 0 < γ 1

L and 0 < p 1. Then, the iterates of Prox Skip (Algorithm 1) satisfy

E[Ψold t+1] (1 ζold)Ψold t , ζold = min(γµ(2 γL), p2).

In other words,

( γµ(2 γL) if γ γcrit p2 if γ > γcrit.

γcrit is the root of γµ(2 γL) = p2 if it exists, otherwise positive infinity.

Remark 2.2. The bound in Theorem 2.1 is slightly tighter than original result ζ = min(γµ, p2) in (Mishchenko et al., 2022).

Tighter Analysis for Prox Skip

0.25 0.50 0.75 1.00 γ

(a) p = 0.3

0.25 0.50 0.75 1.00 γ

(b) p = 0.1

0.25 0.50 0.75 1.00 γ

(c) p = 0.01

Figure 1: Relationship between step size and decreasing speed of Lyapunov function when µ = 0.1 and L = 1

Notably, Theorem 2.1 suggests that there are two regimes: gradient descent is the bottleneck when γ is sufficiently small; proximal operator is the bottleneck when γ is sufficiently large or equivalently when p is sufficiently small. In our improved analysis later on, we use a critical value of γ, denoted by γcrit, to separate the two regimes. All our improvements are focused on the second regime, where proximal operator becomes the bottleneck, i.e. γ > γcrit.

3. New Analysis

In this section, we present a new analysis of Prox Skip. The major difference between our analysis and the existing analysis is a novel Lyapunov function. In particular,

Ψnew t = xt x 2 + γ2

p xt x , ht h . (2)

Compared with the old Lyapunov function, there is an additional inner product term. Note that the new Lyapunov function remains a quadratic form, and it is still positive definite when 1 < < 1. Thus, when Ψnew t converges, we still have xt converging to x and ht converging to h = f(x ).

Based on this new Lyapunov function, we derive a new convergence rate as follows.

Theorem 3.1. Let f be µ-strongly convex with positive µ > 0 and L-smooth with L > µ, and let 0 < γ 1

L and 0 < p 1. Moreover, if γ γcrit, let = 0, otherwise, let be the unique solution in (0, 1) of following equation:

p + 1 2 + Lµγ2

γµ(2 γL) p2

Then, the iterates of Prox Skip (Algorithm 1) satisfy

E[Ψnew t+1] (1 ζnew)Ψnew t ,

( γµ(2 γL) if γ γcrit p2 + (p p2) if γ > γcrit.

Our improvement depends on the construction of the new Lyapunov function which will be explained in detail in Section 5.

We remark that for the regime where proximal operator is the bottleneck, Theorem 3.1 always provides better result than Theorem 2.1, since the decreasing speed of Lyapunov function is improved from p2 to p2 + (p p2). This improvement can be more pronounced in certain special limits when = ω(p). Here ω(p) means it decreases strictly and asymptotically slower than p for small p. As an example, we discuss continuous limit in Section 4, where converges to a constant at the limit of p 0. Another example is SProx Skip, where the gradients are replaced by noisy stochastic gradients. The result for SProx Skip heavily depends on the convergence rate in the continuous limit, which will be discussed in Section 6.1.

3.1. Parameter Selection

In this subsection, we discuss how the step size γ used in gradient descent affects the convergence rate of Prox Skip. The traditional theory of convergence rate has led to a misconception that, within the step size range γ 1/L, larger step sizes always lead to faster convergence, as ζold is monotonically increasing with respect to γ. However, our new theory does not support this conjecture.

In our new theory, ζnew is not monotonic with respect to γ. As demonstrated in Figure 1, our new theory yields a much higher decreasing speed of Lyapunov function than the traditional theory when p is small enough, and for this rate to be achieved a proper step size must be chosen.

Tighter Analysis for Prox Skip

0 50 100 150 200 t

optimal γ γ = 1/L

(a) p = 0.1

0 500 1000 1500 2000 t

optimal γ γ = 1/L

(b) p = 0.01

Figure 2: Prox Skip experiment for comparing of different step sizes.

This raises a natural question: given that the frequency p of the proximal operator is fixed, how do we choose the optimal step size to maximize ζnew? To this end, we propose Algorithm 2 to solve this problem. We also provide theoretical guarantees for this algorithm in Theorem 3.2.

Algorithm 2 Step size selection for Prox Skip

1: probability p > 0, smoothness constant L, strongly convex constant µ.

1 κ then return end if

3: while γ doesn t converge do 4: Solve Equation (3) 5: γ 1

L p(1 (1 )(1 p))

(p+(1 p) ) 6: end while

Theorem 3.2. The iterations in Algorithm 2 always converge and γ will converge to argmaxγ ζnew.

Finally, we conduct an experiment to verify that the cost of large step sizes is not just an artifact of theory. We consider Nesterov s worst function in the world (Nesterov, 1998) in its strongly convex version as f(x) and an indicator function as ψ(x) (see Appendix A.5 for the details). As shown in Figure 1, for p = 0.1, there is about twofold gap between the convergence rate when using the largest step size and when using Algorithm 2 to select step size. For p = 0.01, this gap is around 20-fold. Figure 2 provides similar results for each step size by plotting 10 random samples of the numerical process.

Note that we are not claiming that Algorithm 2 can replace step size tuning in practice. One reason is that the algorithm is highly dependent on κ which is prone to under/overestimation and is only a global characterization of the local behavior. Another reason is that Theorem 3.1 is not tight, therefore the optimal in the sense of Theorem 3.1 is not really optimal for Prox Skip.

Algorithm 3 ODEProx

1: Horizon τ > 0, initial iterate x0 Rd, initial control variate h0 Rd, total time T 0 2: t 0 3: loop 4: Sample random variable ˆτ Exp( 1

τ ) 5: t min(T, t + ˆτ) 6: ˆxt xt 7: Solve dˆxs

ds = ( f(ˆxs) ht) for s [t, t ] 8: if t = T then 9: xt = ˆxt , ht = ht 10: break 11: end if 12: xt = proxτψ(ˆxt τht) 13: ht = ht + 1

τ (xt ˆxt ) 14: t t

15: end loop

4. Continuous Limit

In this section, we consider the situation in which τ = γ/p is fixed and both γ and p tend to 0. Under this setting, Prox Skip will evolve into a new algorithm, which we call ODEProx (Algorithm 3), just like Gradient Descent transforms into Gradient Flow in the limit of infinitesimal stepsize.

The Lyapunov function for ODEProx is as follows for t 0:

Ψt = xt x 2+τ 2 ht h 2 2 τ xt x , ht h

Theorem 3.1 implies following convergence guarantee.

Corollary 4.1. Let f be µ-strongly convex and L-Lipschitz smooth. Moreover, let be the unique solution in (0, 1) of following equation:

3 2 (µτ + 1) 2 + Lµτ 2 + 2µτ + 2 2µτ = 0. (4)

Tighter Analysis for Prox Skip

For T 0, the iterates of ODEProx (Algorithm 3) satisfy

Algorithm 2 transforms into Algorithm 4.

Algorithm 4 Horizon selection for ODEProx

1: Smoothness constant L, strongly convex constant µ. 2: τ 1

L 3: while τ doesn t converge do 4: Solve Equation (4) 5: τ 1

(1 )) 6: end while

The following theorem characterizes the proximal operator complexity of ODEProx with τ from Algorithm 4.

Corollary 4.2. Let f be µ-strongly convex and L-Lipschitz smooth. Under optimal choice of τ, we have

2 (κ 1) 2 2 + 2 (1 )2 = 0,

and when κ is large, we have = Θ(1/ κ). Furthermore, the expected oracle complexity of proximal operator in order to achieve E[ΨT ] ε is T/τ = Θ(1/ ) = Θ( κ).

The continuous limit has three implications.

First, it highlights the deficiency of the existing theory as we observe that the convergence rate of the existing theory (Theorem 2.1) reaches 0 in the continuous limit. This is because the decreasing speed of Lyapunov function of the existing theory is O p2 , while that of the new theory is O(p). We will explain in detail why there is such a difference between them in the next section.

Second, it implies that if we can solve a gradient flow with respect to f(x), we can also solve the corresponding composite optimization problem by incorporating the proximal operator, as demonstrated in Figure 3.

Finally, under this continuous limit, the asymptotic proximal operator complexity is identical to that with finite step sizes, which suggests that we always can reduce both γ and p simultaneously without worrying about degenerated convergence rates.

5. Main Idea of Proof

In this section, we will introduce the main idea behind the construction of our new Lyapunov function and illustrate the intuition through a simple example. Moreover, we will discuss the new proof techniques required for the new Lyapunov function.

0 10 20 30 40 50 t

Figure 3: Convergence of ODEProx.

5.1. Intuition

To better explain why the old theory is not tight, we construct a special example in which the old theory is unable to guarantee convergence. As we observe under continuous limit, the lower bound of the decreasing speed of the old theory approaches 0, which indicates that we can find a special point such that the expected decreasing speed of Ψold t from this point on is 0.

Our particular construction is as follows. Consider a onedimensional problem with ψ(x) = 0 and f(x) = x 2, with starting point x0 = 0, h0 = 1. The contour plot of the old Lyapunov function is shown in Figure 4.

In ODEProx, there are two operations involved, namely gradient flow and proximal operator. As can be seen from the figure, neither of them can directly lower the Lyapunov function, since for gradient flow, the moving direction is tangent to the contour lines, while for proximal operator, both start and end points are on the same contour line. Therefore in this example, the decrease speed of the old Lyapunov function is 0.

With this example, we can show how our new Lyapunov function bypasses the problem of the old Lyapunov function

Gradient Flow

Lyapunov function

Figure 4: Illustration of old Lyapunov function

Tighter Analysis for Prox Skip

in a very straightforward way. We add an inner product term to the Lyapunov function. The contour lines of the new Lyapunov function are no longer circles, but ellipses, as shown in Figure 5, which enables gradient flow to lower the Lyapunov function directly in this example.

Gradient Flow

Lyapunov function

Figure 5: Illustration of new Lyapunov function

Thus we have skirted around the issue of the old Lyapunov function.

5.2. Property of Proximal Operator

Our new Lyapunov function, although resolving the issue of the old Lyapunov function, also makes previous proof techniques unavailable. In previous proofs, a lemma based on firm nonexpansiveness is used to give an upper bound for the Lyapunov function after applying proximal operator.

Lemma 5.1 (Firm nonexpansiveness). For any t 0, if θt = 1

xt+1 x 2 + γ2

p2 ht+1 h 2

(ˆxt+1 x ) γ

Note that the left-hand side of Lemma 5.1 is consistent with Ψold t+1, thus this lemma is sufficient for proving convergence of the old Lyapunov function. However, this is not the case for the new Lyapunov function.

Therefore, we consider two independent lemmas instead, as shown in Lemmas 5.2 and 5.3.

Lemma 5.2 (Invariance). For any t 0,

(xt+1 x ) γ

p (ht+1 h ) = (ˆxt+1 x ) γ

Lemma 5.3 (Monotonicity). For any t 0, if θt = 1

xt+1 x , ht+1 h 0.

Note that Lemma 5.1 is a simple corollary of Lemmas 5.2 and 5.3, but not vice versa. Hence replacing the firm nonexpansiveness property with an invariance plus monotonicity gives us more freedom in our proof, allowing us to complete the proof of convergence for the new Lyapunov function. The full proof is written in Appendix A.3.

6. Extensions

In this section, we extend the ideas and techniques from the previous sections to other variants of Prox Skip. Specifically, we consider:

SProx Skip (Mishchenko et al., 2022), which substitutes the gradient with a stochastic gradient;

Prox Skip-VR (Malinovsky et al., 2022), which allows Prox Skip to be combined with variance reduction techniques;

Grad Skip+ (Maranjyan et al., 2022), which allows an additional unbiased compressor to be added to the gradient descent step and extends the random procedure of proximal operators to a general unbiased compressor.

6.1. SProx Skip

When only a stochastic gradient oracle is available or computing the full gradient is too costly, SProx Skip (Mishchenko et al., 2022) can be used as a stochastic gradient variant of Prox Skip. The only difference between the two algorithms lies in the substitution of f(xt) with gt(xt). We provide the pseudocode of SProx Skip in Algorithm 5 in the Appendix.

For stochastic gradients, we need to introduce the following assumptions:

Assumption 6.1 (Unbiasedness). For all t 0, gt(xt) is an unbiased estimator of the gradient f(xt). That is,

E[gt(xt) | xt] = f(xt)

Assumption 6.2 (Expected smoothness). There exist constants A 0 and C 0 such that for all t 0

E[ gt(xt) f(x ) 2 | xt] 2ADf(xt, x ) + C

Remark 6.3. The commonly used bounded variance assumption Var[gt(xt) | xt] σ2 implies Assumption 6.2 with A = L and C = σ2.

For SProx Skip, we define a Lyapunov function Ψold and Ψnew identical to Prox Skip s. We quote from (Mishchenko et al., 2022) for existing analysis results for SProx Skip:

Theorem 6.4. Under Assumptions 6.1 and 6.2, let 0 < γ 1/A and 0 < p 1. Then, the iterates of SProx Skip

Tighter Analysis for Prox Skip

Method Communication Complexity

Fed Avg (Li et al., 2020) O σ2 µNKε + G2K

Scaffold (Karimireddy et al., 2020) e O σ2 µSKε + 1

SProx Skip (Mishchenko et al., 2022) O p

1/εµ2 log(1/ε)

SProx Skip (Our) O( κ log(1/ε))

Table 1: Different stochastic local methods for strongly convex optimization.

(Algorithm 5) satisfy

E Ψold T (1 ζold)T Ψold 0 + γ2C

ζold = min(γµ(2 γA), p2).

γcrit is the root for two branches to be equal if it exists, otherwise positive infinity. Remark 6.5. Similar to Theorem 2.1, the bound in Theorem 6.4 is slightly improved compared to original result in (Mishchenko et al., 2022)..

By applying similar analysis techniques as Theorem 3.1, we can prove the following convergence result: Theorem 6.6. Under the same assumption as Theorem 6.4, if γ γcrit, let = 0. Otherwise, let be the unique solution in (0, 1) of following equation:

p + 1 2 + Lµγ2

γµ(2 γA) p2

Then, the iterates of SProx Skip (Algorithm 5) satisfy

E [Ψnew T ] (1 ζnew)T Ψnew 0 + γ2C

( γµ(2 γA) if γ γcrit p2 + (p p2) if γ > γcrit.

The new decreasing speed of Lyapunov function is ζnew = O(p), compared to ζold = O p2 , suggesting an asymptotically better result on proximal operator complexity.

We first review the asymptotic complexity of the old analysis: based on (Mishchenko et al., 2022), the stepsize is γ = min 1

2C , and the proximal operator frequency is p = γµ, and the proximal operator complexity is

2C εµ2 o log 2Ψ0

ε . Note that the choice p =

O γ is a direct consequence of ζold = O min(γ, p2) , in order to make sure ζold is not too small. The improvement of ζold directly allows us to use a smaller frequency of p, thus reducing the proximal operator complexity.

Our new theory suggests following asymptotic complexity:

0.0 2.5 5.0 7.5 10.0 12.5 15.0 Expected Number of Proximal Operater Calls

p=0.100 p=0.030 p=0.010 p=0.003 p=0.001

Figure 6: SProx Skip experiment.

Corollary 6.7. Under the same assumption as Theorem 6.4, we define ODE as the solution of Equation (4) and τ = γ

If ε is small, we set p = ODE

τ 2C ε and T = 1 p ODE log Ψnew 0 ε . Then we have E[Ψnew T ] = 2ε + O ε2 and the oracle complexity of proximal operator is p T = 1 ODE log Ψnew 0 ε . When κ is large, with good choice of τ obtained from Algorithm 4, the oracle complexity of proximal operator is p T = Θ( κ log(1/ε)).

Comparing the new and old theories, we see that the proximal operator complexity in the new theory only logarithmically depends on ε, which is much better than the old theory.

To verify our new convergence rate, we conducted an experiment by manually adding gradient noise: we fixed τ = γ/p to be the optimal value for ODEProx and tried different values of p and γ = τp. To reduce noise, we take the average of 1000 random runs of the algorithm. As shown in Figure 6, SProx Skip maintains linear convergence until it reaches a flat error, which can be further reduced by decreasing both γ and p.

6.2. Prox Skip-VR

In order to be compatible with various variance reduction techniques, Prox Skip-VR uses a stochastic gradient g(xt, yt, ξt) similar to SProx Skip instead of the full gradient f(xt). It satisfies the following assumption:

Tighter Analysis for Prox Skip

Assumption 6.8 (Unbiasedness). For all t 0, gt = g(xt, yt, ξt) is an unbiased estimator of the gradient f(xt). That is, E[gt | xt, yt] = f(xt)

Assumption 6.9 (Variance reduction). There exist constants A 0 and C 0 such that for all t 0

E[ gt f(x ) 2 | xt, yt] 2ADf(xt, x ) + Bσt + C

E[σt+1 | xt, yt] 2 ADf(xt, x ) + Bσt + C

The Lyapunov function for Prox Skip-VR is given as:

Ψold t = xt x 2 + γ2

p2 ht h 2 + γ2Wσt

and its convergence analysis is as follows:

Theorem 6.10. Under Assumptions 6.8 and 6.9 with B = 0 and a number W > B/(1 B), let 0 < γ 1/(A + W A) and 0 < p 1. Then, the iterates of Prox Skip-VR (Algorithm 6) satisfy

E Ψold T (1 g ζold)T Ψold 0 + γ2(C + W C)

ζold = min(γµ(2 γ(A + W A)), p2).

g ζold = min(ζold, 1 (B + W B)/W)

γcrit is the root for two branches of ζold to be equal if it exists, otherwise positive infinity.

Remark 6.11. In Theorem 6.10, we didn t discuss the condition of B = 0 as in (Malinovsky et al., 2022), because in that case, Assumption 6.9 for Prox Skip-VR degenerates into Assumption 6.2 for SProx Skip, therefore the Lyapunov function and analysis in Theorems 6.4 and 6.6 applies.

Our new Lyapunov function and convergence analysis are presented as follows:

Ψnew t = xt x 2 + γ2

p xt x , ht h + γ2Wσt

Theorem 6.12. Under the same assumption as Theorem 6.10, if γ γcrit, let = 0. Otherwise, let be the unique solution in (0, 1) of following equation:

p + 1 2 + Lµγ2

γµ(2 γ(A + W A)) p2

Then, the iterates of Prox Skip-VR (Algorithm 6) satisfy

E [Ψnew T ] (1 g ζnew)T Ψnew 0 + γ2(C + W C)

( γµ(2 γ(A + W A)) if γ γcrit p2 + (p p2) if γ > γcrit.

g ζnew = min(ζnew, 1 (B + W B)/W)

Prox Skip-VR encompasses a range of algorithms such as Prox Skip-HUB and Prox Skip-LSVRG as special cases (Malinovsky et al., 2022). We do not elaborate further on each of them individually since their convergence can be derived from the general description in Theorem 6.12.

6.3. Grad Skip+

As a general theoretical framework, Grad Skip+ is built on the foundation of unbiased compressors which satisfy the following condition, where I is identity matrix: Definition 6.13 (Unbiased Compressors). For any positive semidefinite matrix Ω 0, denote by Bd(Ω) the class of (possibly randomized) unbiased compression operators C : Rd Rd such that for all x Rd we have

E[C(x)] = x, E[ (I + Ω) 1C(x) 2] x 2 (I+Ω) 1.

The class Bd(Ω) is a generalization of commonly used class Bd(ω) of unbiased compressors with variance bound

E[ C(x) 2] (1 + ω) x 2

for some scalar ω 0.

The analysis of Grad Skip+ is further based on matrix smoothness, a generalization of Lipschitz-smoothness, defined as: Definition 6.14 (Matrix Smoothness). A differentiable function f : Rd R is called L-smooth with some symmetric and positive definite matrix L > 0 if Df(x, y) 1 2 x y 2 L, x, y Rd.

The old analysis has the Lyapunov function:

Ψold = xt x 2 + γ2(1 + ω2) ht h 2,

and its convergence is as follows: Theorem 6.15. Let f be µ-strongly convex with positive µ > 0 and L-smooth, Cω Bd(ω) and CΩ Bd(Ω) be the compression operators, and

Ω:= I + ω(ω + 2)Ω(I + Ω) 1.

Then, if the stepsize γ λ 1 max(L Ω), the iterates of Grad Skip+ (Algorithm 7) satisfy

E[Ψold t+1] (1 ζold)Ψold t ,

γµ(2 γλmax(L Ω)), λmin( Ω)

γcrit is the root for two branches to be equal if it exists, otherwise positive infinity.

Tighter Analysis for Prox Skip

Remark 6.16. Again, the bound in Theorem 6.15 is slightly improved compared to original result in (Maranjyan et al., 2022).

The new analysis with the Lyapunov function and convergence guarantee is presented below:

Ψnew t = xt x 2 + γ2(1 + ω)2 ht h 2

2 γ(1 + ω) xt x , ht h

Theorem 6.17. Under the same assumption as Theorem 6.15, if γ γcrit, let = 0 and Ψnew t is the same as Ψold t .

When γ > γcrit, let

Ω:= (1 + ω )I + ω(ω + 2 )Ω(I + Ω) 1

α = γ2λmax(L Ω)

ζ = λmin( Ω)

β = 2γ ( (ζ (ω + 1) 1) + 1) α

There always exists some (0, 1) such that

(βµ ζ)I (ωI + (Ω+ I) (ω + 1) (1 ζ))2

ω (Ω+ I) 0, (6)

and for any that satisfy above condtion, we have

E[Ψnew t+1] (1 ζnew)Ψnew t ,

ζnew = ζ > ζold.

Remark 6.18. The optimal choice of is the largest value such that inequality (6) holds. We didn t compute the optimal value as it involves an intricate optimization. However, in practice, the optimal can be easily determined with bisect search. It might also be easy to solve in special cases if L and Ωhave special structure.

Note that Theorem 6.17 and Theorem 3.1 for Prox Skip are slightly different in their formula since certain simplifications cannot be made for Grad Skip+ due to L being a matrix. Grad Skip+ encompasses a range of algorithms such as Grad Skip(Maranjyan et al., 2022), Prox Skip and Rand Prox-FB (Condat & Richt arik, 2022) as special cases.

7. Conclusion

In this paper, we proposed a new way to improve the theoretical analysis of Prox Skip and provided deeper understanding on Prox Skip technique. In the context of federated learning, fewer proximal operator steps mean lower communication complexity. Through our discussion, we provided a lower proximal operator complexity when the proximal operator is the bottleneck. Our analysis revealed that in local training methods like Prox Skip, the step size should not be too

large, especially when the frequency of proximal operators is relatively low. We for the first time demonstrated that SProx Skip, a local training method using stochastic gradients, can train to a certain accuracy with communication complexity logarithmically dependent on the accuracy. Finally, we showed that our proof technique can be directly extended to many other variants of Prox Skip.

Acknowledgement

This work was partially supported by NSF IIS 1838627, 1837956, 1956002, 2211492, CNS 2213701, CCF 2217003, DBI 2225775.

Bao, R., Wu, X., Xian, W., and Huang, H. Doubly sparse asynchronous learning for stochastic composite optimization. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, pp. 1916 1922, 2022.

Cand es, E. and Recht, B. Exact matrix completion via convex optimization. Communications of the ACM, 55 (6):111 119, 2012.

Cand es, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of the ACM (JACM), 58(3): 1 37, 2011.

Combettes, P. L. and Wajs, V. R. Signal recovery by proximal forward-backward splitting. Multiscale modeling & simulation, 4(4):1168 1200, 2005.

Condat, L. and Richt arik, P. Randprox: Primal-dual optimization algorithms with randomized proximal updates. In OPT 2022: Optimization for Machine Learning (Neur IPS 2022 Workshop), 2022.

Gorbunov, E., Hanzely, F., and Richt arik, P. Local sgd: Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, pp. 3556 3564. PMLR, 2021.

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132 5143. PMLR, 2020.

Khaled, A., Mishchenko, K., and Richt arik, P. First analysis of local gd on heterogeneous data. ar Xiv preprint ar Xiv:1909.04715, 2019.

Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2020.

Tighter Analysis for Prox Skip

Lin, T., Stich, S. U., Patel, K. K., and Jaggi, M. Don t use large mini-batches, use local sgd. ar Xiv preprint ar Xiv:1808.07217, 2018.

Lustig, M., Donoho, D., and Pauly, J. M. Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, 58(6):1182 1195, 2007.

Malinovsky, G., Yi, K., and Richt arik, P. Variance reduced proxskip: Algorithm, theory and application to federated learning. Neur IPS, 2022.

Mangasarian, O. L. and Solodov, M. V. Backpropagation convergence via deterministic nonmonotone perturbed minimization. Advances in Neural Information Processing Systems, 6, 1993.

Maranjyan, A., Safaryan, M., and Richt arik, P. Gradskip: Communication-accelerated local gradient methods with better computational complexity. ar Xiv preprint ar Xiv:2210.16402, 2022.

Mc Donald, R., Hall, K., and Mann, G. Distributed training strategies for the structured perceptron. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, pp. 456 464, 2010.

Mc Mahan, H. B., Moore, E., Ramage, D., and y Arcas, B. A. Federated learning of deep networks using model averaging. ar Xiv preprint ar Xiv:1602.05629, 2, 2016.

Mishchenko, K., Malinovsky, G., Stich, S., and Richt arik, P. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! In International Conference on Machine Learning, ICML 2022, volume 162 of Proceedings of Machine Learning Research, pp. 15750 15769. PMLR, 2022.

Mitra, A., Jaafar, R., Pappas, G. J., and Hassani, H. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606 14619, 2021.

Nesterov, Y. Introductory lectures on convex programming volume i: Basic course. Lecture notes, 3(4):5, 1998.

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical programming, 140(1):125 161, 2013.

Parikh, N., Boyd, S., et al. Proximal algorithms. Foundations and trends in Optimization, 1(3):127 239, 2014.

Passty, G. B. Ergodic convergence to a zero of the sum of monotone operators in hilbert space. Journal of Mathematical Analysis and Applications, 72(2):383 390, 1979.

Stich, S. U. Local sgd converges fast and communicates little. ar Xiv preprint ar Xiv:1805.09767, 2018.

Tibshirani, R. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):273 282, 2011.

Wu, X., Huang, F., Hu, Z., and Huang, H. Faster adaptive federated learning. ar Xiv preprint ar Xiv:2212.00974, 2022.

Zhang, J., De Sa, C., Mitliagkas, I., and R e, C. Parallel sgd: When does averaging help? ar Xiv preprint ar Xiv:1606.07365, 2016.

Tighter Analysis for Prox Skip

A. Prox Skip

A.1. Useful Lemmas

We define τ := γ

Proof of Lemma 5.1. Define

P(x) := proxτψ(x),

Q(x) := x P(x).

Due to firm nonexpansiveness, we have for any x, y:

P(x) P(y) 2 + Q(x) Q(y) 2 x y 2.

x = ˆxt+1 τht,

Notice that P(y) = x , we have

xt+1 x 2 + ˆxt+1 xt+1 τht + τh 2 (ˆxt+1 x ) τ(ht h ) 2.

According to Line 10 in Algorithm 1, ˆxt+1 xt+1 τht = τht+1,

Therefore we have xt+1 x 2 + τ 2 ht+1 h 2 (ˆxt+1 x ) τ(ht h ) 2

Another proof of Lemma 5.1 based on Lemmas 5.2 and 5.3. Due to Lemma 5.2, we have

τ(ht h ) (ˆxt+1 x ) 2 = τ(ht+1 h ) (xt+1 x ) 2

=τ 2 ht+1 h 2 + xt+1 x 2 2τ ht+1 h , xt+1 x

τ 2 ht+1 h 2 + xt+1 x 2.

The last equation is due to Lemma 5.3.

Proof of Lemma 5.2. According to Line 10 in Algorithm 1,

τht+1 xt+1 = τht ˆxt+1,

τ(ht+1 h ) (xt+1 x ) = τ(ht h ) (ˆxt+1 x ).

Proof of Lemma 5.3. According to Line 6 in Algorithm 1,

(ˆxt+1 τht) xt+1 τ ψ(xt+1).

According to Line 10 in Algorithm 1, ht+1 ψ(xt+1).

Similarly, we have h = f(x ) ψ(x ).

Since ψ is closed convex function, its subgradient ψ is a monotone operator, therefore

( ht+1) ( h ), xt+1 x 0.

Tighter Analysis for Prox Skip

A.2. Old Analysis

Proof of Theorem 2.1.

Ψold t+1 = xt+1 x 2 + τ 2 ht+1 h 2

= (xt+1 x ) τ(ht+1 h ) 2 + 2τ xt+1 x , ht+1 h .

Step 1 (expand the proximal operator) If θt = 1, according to Lemmas 5.2 and 5.3,

Ψold t+1|θt=1 (ˆxt+1 x ) τ(ht h ) 2.

Ψold t+1|θt=0 = (ˆxt+1 x ) τ(ht h ) 2 + 2τ ˆxt+1 x , ht h .

Taking expectation gives

E[Ψold t+1] (ˆxt+1 x ) τ(ht h ) 2 + 2τ(1 p) ˆxt+1 x , ht h . (7)

Step 2 (expand the gradient descent) Expand ˆxt+1 according to Line 3 in Algorithm 1 gives

E[Ψold t+1] (1 p2)τ 2 ht h 2 + xt x 2

2γ xt x , f(xt) f(x ) + γ2 f(xt) f(x ) 2.

Step 3 (apply strongly convexity and smoothness) We have

xt x , f(xt) f(x ) 1

L f(xt) f(x ) 2,

xt x , f(xt) f(x ) µ xt x 2.

Apply the above two inequalities with additional multipliers α and β, we have

E[Ψold t+1] (1 p2)τ 2 ht h 2 + (1 βµ) xt x 2

L) f(xt) f(x ) 2

+ (α + β 2γ) xt x , f(xt) f(x ) .

α + β 2γ = 0,

which gives α = Lγ2 and β = γ(2 γL). Then we have

E[Ψold t+1] (1 p2)τ 2 ht h 2 + (1 γµ(2 γL)) xt x 2

(1 ζold)Ψold t ,

ζold = min(γµ(2 γL), p2).

A.3. New Analysis

Proof of Theorem 3.1. When γ γcrit, we have = 0, so the Lyapunov function and the convergence rate is same as Theorem 2.1. Therefore, we only need to focus on γ > γcrit case. The following analysis only assumes | | < 1, such that Lyapunov function is positive definite.

Ψnew t+1 = xt+1 x 2 + τ 2 ht+1 h 2 2 τ xt+1 x , ht+1 h

= (xt+1 x ) τ(ht+1 h ) 2 + 2(1 )τ xt+1 x , ht+1 h .

Tighter Analysis for Prox Skip

Step 1 (expand the proximal operator) If θt = 1, according to Lemmas 5.2 and 5.3,

Ψnew t+1|θt=1 (ˆxt+1 x ) τ(ht h ) 2.

Ψnew t+1|θt=0 = (ˆxt+1 x ) τ(ht h ) 2 + 2(1 )τ ˆxt+1 x , ht h .

Taking expectation gives

E[Ψnew t+1] (ˆxt+1 x ) τ(ht h ) 2 + 2(1 )τ(1 p) ˆxt+1 x , ht h . (8)

Step 2 (expand the gradient descent) Expand ˆxt+1 according to Line 3 in Algorithm 1 gives

E[Ψnew t+1] xt+1 x 2

+(2 γ2 2 γτ γ2 + τ 2) ht h 2

+γ2 f(xt) f(x ) 2

+2 γ(τ γ) f(xt) f(x ), ht h

2γ f(xt) f(x ), xt x

2 (τ γ) ht h , xt x .

Step 3 (apply strongly convexity and smoothness) We have

xt x , f(xt) f(x ) 1

L f(xt) f(x ) 2,

xt x , f(xt) f(x ) µ xt x 2.

, Additionally, we have c1(ht h ) c2(xt x ) c3( f(xt) f(x )) 2 0.

We also have xt x 2 + τ 2 ht h 2 2 τ xt x , ht h 2 = Ψnew t .

Apply the above three inequalities and one equality with additional multipliers α, β, 1 and (1 ζ), we have

E[Ψnew t+1] ζ + c2 2 βµ xt+1 x 2

+ 2 γ2 2 γτ γ2 + ζτ 2 + c2 1 ht h 2

+ γ2 + c2 3 α

f(xt) f(x ) 2

2 γ2 γτ + c1c3 f(xt) f(x ), ht h

+ (α + β 2γ + 2c2c3) f(xt) f(x ), xt x

+ 2 ( γ ζτ c1c2) ht h , xt x

+ (1 ζ)Ψnew t .

α + β 2γ + 2c2c3 = 0, (9)

γ2 + c2 3 α

L = 0, (10)

ζ + c2 2 βµ = 0, (11)

2 γ2 2 γτ γ2 + ζτ 2 + c2 1 = 0, (12)

γ2 γτ + c1c3 = 0, (13)

γ ζτ c1c2 = 0. (14)

Tighter Analysis for Prox Skip

Solving c1, c2, c3, α and β on Equations (10) to (14) gives

2 γ2 + 2 γτ + γ2 ζτ 2,

c2 = (γ ζτ)

c3 = γ (τ γ)

α = L(γ2 + c2 3),

β = ζ + c2 2 µ .

Applying above solution into Equation (9) gives

E1 :=Lγ4µ 2Lγ3µτ + Lγ2µτ 2 2γ3µ + 2γ2µζτ + 2γ2µτ + γ2 2γµζτ 2 2γζτ + ζ2τ 2

+ 2Lγ4µ + 2Lγ3µτ + 4γ3µ 4γ2µτ 2γ2ζ + 2γζτ

+ Lγ4µ Lγ2µζτ 2 2γ3µ + γ2ζ + 2γµζτ 2 ζ2τ 2

Step 4 (optimize free parameter ) E1(ζ, ) = 0 is an implicit function, and we wish the decreasing speed ζ to be optimized. Therefore, we require

which further implies Lγ2µ 2γµ + ζ γ2 γτ γ2 + ζτ 2 =0.

Two roots are

ζ = γµ(2 γL),

ζ = γ(γ + (τ γ)))

If we pick the first root, E1 = 0 implies Lγ2µ 2γµ + 1 Lµτ 2 2µτ + 1 = 0,

which is impossible because Lγ2µ 2γµ + 1 Lµτ 2 2µτ + 1 > (1 γµ)2(1 µτ)2 0.

Therefore, we proceed with the second root. In that case E1 = 0 implies

E3 := 3 2 (µτ + 1) 2 + Lµτ 2 + 2µτ + 2 µτ 2(2 γL) γ

τ γ = 0. (15)

The existence and uniqueness of the solution is discussed in Lemma A.1.

Lemma A.1. Equation (15) contains exactly one root in interval (0, 1) and doesn t have root in ( 1, 0].

Proof of Lemma A.1. We first prove that E3 is monotonically increasing in ( 1, 1).

d is a quadratic function for . Its minimum is achieved at = 2

Tighter Analysis for Prox Skip

3(1 + µτ) < 1, then the minimum is E3| = 2

3 (1+µτ) = Lµτ 2 4µ2τ 2

4. The first inequality follows L > µ and second follows 0 < µτ < 1

3(1 + µτ) 1, then inf ( 1,1) d E3

d | =1 = Lµτ 2 2µτ + 1 > µ2τ 2 2µτ + 1 0.

Due to monotonic property, there is at most one root in ( 1, 1).

E3| =0 = µτ 2(2 γL) γ

τ γ = γµ(2 γL) p2

E3| =1 = τ τ γ (1 2µτ + Lµτ 2) > τ τ γ (1 2µτ + µ2τ 2) 0.

Therefore there is exactly one root in (0, 1).

A.4. Parameter Selection

Proof of Theorem 3.2. If p q

1 κ, then we can always choose γ = 1

L to obtain best ζnew.

Otherwise, let i and γi be the result of Lines 4 and 5 in Algorithm 2 after (i + 1)-th iteration for i N. We also use γ 1 to denote the initial step size. Additionally, we use E3( , γ) to denote the left hand side of Equation (3).

Note that according to the discussion in Lemma A.1, E3( , γ) is a monotonically increasing function for (0, 1). Moreover, E3( , γ) is a quadratic and convex function for γ.

The iterate in Algorithm 2 can be rewritten as

i+1 = Solve E3( , γi) = 0,

γi+1 = argmin γ E3( i+1, γ).

Since i is obtained by finding a root, we have E3( i, γi 1) = 0. Since γi is obtained by minimization, we have E3( i, γi) 0. We also have E3( i+1, γi) = 0. According to the fact that E3( , γ) is monotonically increasing function for (0, 1), we have i+1 > i.

Since i < 1 all the time, we can establish the convergence of .

Next, the function E3( , ) is always strongly convex and converge uniformly, therefore γi = argminγ E3( i, γ) also converge.

Finally, we have converge to the maximum possible value, therefore ζnew = p2 + (p p2) will also converge to its maximum.

A.5. Toy Model Setup

The following definitions are used for producing Figures 2, 3 and 6.

f(x) = µ(κ 1)

i=1 (x(i) x(i+1))2 + x(d) #

( 0 if x(1) = 0 + otherwise ,

d = 10, κ = 10, µ = 0.1.

The global optimum is x = 0. The initial point is x0 = [1, 0, 0, . . . ] and h0 = 0.

When SProx Skip is used, we set gt(xt) = f(xt) + 0.1 e where e N(0, I).

Tighter Analysis for Prox Skip

A.6. Continuous Limit

Proof of Corollary 4.1. By substituting p = γ/τ, we have

3 2 (µτ + 1) 2 + Lµτ 2 + 2µτ + 2 µτ 2(2 γL) γ

Taking the limit of γ 0, we get the condition Equation (4).

Proof of Corollary 4.2. By substituting τ = 1

(1 )), we have

2 (κ 1) 2 2 + 2 (1 )2 = 0.

B. SProx Skip

B.1. Algorithm

Algorithm 5 SProx Skip

Stepsize γ > 0, probability p > 0, initial iterate x0 Rd, initial control variate h0 Rd, number of iterations T 1 for t = 0, 1, ..., T 1 do

ˆxt+1 = xt γ(gt(xt) ht) Take a gradient-type step adjusted via the control variate ht Flip a coin θt {0, 1} where Prob(θt = 1) = p Flip a coin that decides whether to skip the prox or not if θt = 1 then

xt+1 = prox γ

p ψ(ˆxt+1 γ

pht) Apply prox, but only very rarely! (with small probability p) else

xt+1 = ˆxt+1 Skip the prox! end if

ht+1 = ht + p

γ (xt+1 ˆxt+1) Update the control variate ht end for

B.2. Old Analysis

Proof of Theorem 6.4. Based on the same argument as step 1 in Appendix A.2, we have following middle result same as Equation (7): E[Ψold t+1] (ˆxt+1 x ) τ(ht h ) 2 + 2τ(1 p) ˆxt+1 x , ht h . (16)

Step 2 (expand the gradient descent) Expand ˆxt+1 according to Line 3 in Algorithm 5 gives

E[Ψold t+1] (1 p2)τ 2 ht h 2 + xt x 2

2γ xt x , E[gt(xt)] f(x ) + γ2E gt(xt) f(x ) 2

(1 p2)τ 2 ht h 2 + xt x 2

2γ xt x , f(xt) f(x ) + 2γ2ADf(xt, x ) + γ2C.

Step 3 (apply strongly convexity and smoothness) We have

xt x , f(xt) f(x ) Df(xt, x ) + µ

xt x , f(xt) f(x ) µ xt x 2.

Apply the above two inequalities with additional multipliers β1 and β2, we have

E[Ψold t+1] (1 p2)τ 2 ht h 2 + (1 (β1/2 + β2)µ) xt x 2

+ (β1 + β2 2γ) xt x , f(xt) f(x )

+ (2γ2A β1)Df(xt, x ) + γ2C.

Tighter Analysis for Prox Skip

2γ2A β1 = 0,

β1 + β2 2γ = 0,

which gives β1 = 2γ2A and β2 = 2γ(1 γA). Then we have

E[Ψold t+1] (1 p2)τ 2 ht h 2 + (1 γµ(2 γA)) xt x 2 + γ2C

(1 ζold)Ψold t + γ2C,

ζold = min(γµ(2 γA), p2).

Apply Gronwall s inequality, we have

E Ψold T (1 ζold)T Ψold 0 + γ2C

B.3. New Analysis

Proof of Theorem 6.6. Based on the same argument as step 1 in Appendix A.3, we have following middle result same as Equation (8):

E[Ψnew t+1] (ˆxt+1 x ) τ(ht h ) 2 + 2(1 )τ(1 p) ˆxt+1 x , ht h . (17)

Step 2 (expand the gradient descent) Expand ˆxt+1 according to Line 3 in Algorithm 5 gives

E[Ψnew t+1] γ2E gt(xt) f(x ) 2 2γ xt x , E[gt(xt)] f(x )

+ 2γ (τ γ) ht h , E[gt(xt)] f(x )

+ 2 (γ τ) ht h , xt x

+ (γ τ) (γ (2 1) τ) ht h 2 + xt x 2

2γ (τ γ) f(xt) f(x ), ht h

2γ f(xt) f(x ), xt x

+ (γ τ) (γ (2 1) τ) ht h 2

+ 2 (γ τ) ht h , xt x

+ (xt x )2 + 2Aγ2Df(x, x ) + Cγ2.

Step 3 (apply strongly convexity and smoothness) We have

xt x , f(xt) f(x ) 1

L f(xt) f(x ) 2,

xt x , f(xt) f(x ) Df(xt, x ) + µ

xt x , f(xt) f(x ) µ xt x 2.

Additionally, we have c1(ht h ) c2(xt x ) c3( f(xt) f(x )) 2 0.

We also have xt x 2 + τ 2 ht h 2 2 τ xt x , ht h 2 = Ψnew t .

Tighter Analysis for Prox Skip

Apply the above four inequalities and one equality with additional multipliers α, β1, β2, 1 and (1 ζ), we have

E[Ψnew t+1] ζ + c2 2 (β1/2 + β2)µ xt+1 x 2

+ 2 γ2 2 γτ γ2 + ζτ 2 + c2 1 ht h 2

f(xt) f(x ) 2

2 γ2 γτ + c1c3 f(xt) f(x ), ht h

+ (α + β1 + β2 2γ + 2c2c3) f(xt) f(x ), xt x

+ 2 ( γ ζτ c1c2) ht h , xt x

+ (2γ2A β1)Df(xt x )

+ (1 ζ)Ψnew t + γ2C.

α + β1 + β2 2γ + 2c2c3 = 0, (18)

2 γ2 2 γτ γ2 + ζτ 2 + c2 1 = 0, (19)

L = 0, (20)

ζ + c2 2 (β1/2 + β2)µ = 0, (21)

γ2 γτ + c1c3 = 0, (22)

γ ζτ c1c2 = 0. (23)

2γ2A β1 = 0, (24)

Solving c1, c2, c3, α, β1 and β2 on Equations (19) to (24) gives

2 γ2 + 2 γτ + γ2 ζτ 2,

c2 = (γ ζτ)

c3 = γ (τ γ)

β2 = Aγ2µ + ζ + c2 2 µ .

Applying above solution into Equation (18) gives

E1 :=Lγ4µ 2Lγ3µτ + Lγ2µτ 2 2γ3µ + 2γ2µζτ + 2γ2µτ + γ2

2γµζτ 2 2γζτ + ζ2τ 2

+ 2Aγ4µ + 2Aγ3µτ + 4γ3µ 4γ2µτ 2γ2ζ + 2γζτ

+ Aγ4µ Aγ2µζτ 2 2γ3µ + γ2ζ + 2γµζτ 2 ζ2τ 2

Step 4 (optimize free parameter ) E1(ζ, ) = 0 is an implicit function, and we wish the decreasing speed ζ to be optimized. Therefore, we require

Tighter Analysis for Prox Skip

which further implies

Aγ2µ 2γµ + ζ γ2 γτ γ2 + ζτ 2 = 0.

According to the analysis for special case in Appendix A.3, the first root γµ(2 γA) is discarded. And we proceed with second root ζ = γ(γ+ (τ γ)))

τ 2 . Then, E1 = 0 implies

3 2 (µτ + 1) 2 + Lµτ 2 + 2µτ + 2 µτ 2(2 γA) γ

τ γ = 0. (25)

Finally, we have

E[Ψnew t+1] (1 ζ)Ψnew t + γ2C.

Apply Gronwall s inequality, we have

E [Ψnew T ] (1 ζnew)T Ψnew 0 + γ2C

B.4. Asymptotic Convergence Rate

Proof of Corollary 6.7. Let E3 be the left hand side of optimality condition Equation (5) for SProx Skip, and EODE 3 be left hand side of optimality condition Equation (4) for ODEProx, we have E3 = EODE 3 + 1 2µτ+Aµτ 2

(1 p)2 p+O p2 = EODE 3 +O(p).

Let and ODE be the unique solution in (0, 1) for optimality condition E3 = 0 and EODE 3 = 0 separately and ζ = ζnew

for abbreviation, then we have = ODE + O(p) and ζ = ODEp + O p2 .

For each terms in upper bound of E[Ψnew T ] in Theorem 6.6, we have

(1 ζnew)T Ψnew 0 = h 1 p ODE + O p2 1 p ODE ilog Ψ0

e + O(p) log Ψ0

=ε(1 + O(p))

=ε + O ε2 ,

γ2C ζnew = τ 2p2C p ODE + O(p2) = τ 2C

ODE p + O p2 = ε + O ε2 .

Finally we have

E[Ψnew T ] 2ε + O ε2 . (26)

The oracle complexity of proximal operator is p T = 1 ODE log Ψnew 0 ε = Θ(1/ ODE). According the analysis of ODEProx, when κ is large, we have ODE = Θ(1/ κ).

Tighter Analysis for Prox Skip

C. Prox Skip-VR

C.1. Algorithm

Algorithm 6 Prox Skip-VR

1: Stepsize γ > 0, probability p > 0, initial iterate x0 Rd, initial control vector y0 Rd, initial control variate h0 Rd, number of iterations T 1 2: for t = 0, 1, ..., T 1 do 3: gt = g(xt, yt, ξt) 4: ˆxt+1 = xt γ(gt ht) Take a gradient-type step adjusted via the control variate ht 5: Construct new control vector yt+1 6: Flip a coin θt {0, 1} where Prob(θt = 1) = p Flip a coin that decides whether to skip the prox or not 7: if θt = 1 then Apply prox, but only very rarely! (with small probability p) 8: xt+1 = prox γ

p ψ(ˆxt+1 γ

9: else 10: xt+1 = ˆxt+1 Skip the prox! 11: end if 12: ht+1 = ht + p

γ (xt+1 ˆxt+1) Update the control variate ht 13: end for

C.2. Old Analysis

Proof of Theorem 6.10. Based on the same argument as step 1 in Appendix A.2, we have following middle result similar Equation (7):

E[Ψold t+1] (ˆxt+1 x ) τ(ht h ) 2 + 2τ(1 p) ˆxt+1 x , ht h

+ γ2WE[σt+1].

Step 2 (expand the gradient descent) Expand ˆxt+1 according to Line 4 in Algorithm 6 gives

E[Ψold t+1] (1 p2)τ 2 ht h 2 + xt x 2

2γ xt x , E[gt] f(x ) + γ2E gt f(x ) 2

+ γ2WE[σt+1]

(1 p2)τ 2 ht h 2 + xt x 2

2γ xt x , f(xt) f(x ) + 2γ2(A + W A)Df(xt, x )

+ γ2(C + W C) + γ2(B + W B)σt.

Step 3 (apply strongly convexity and smoothness) We have

xt x , f(xt) f(x ) Df(xt, x ) + µ

xt x , f(xt) f(x ) µ xt x 2.

Apply the above two inequalities with additional multipliers β1 and β2, we have

E[Ψold t+1] (1 p2)τ 2 ht h 2 + (1 (β1/2 + β2)µ) xt x 2

+ (β1 + β2 2γ) xt x , f(xt) f(x )

+ (2γ2(A + W A) β1)Df(xt, x ) + γ2(C + W C)

+ γ2(B + W B)σt.

Tighter Analysis for Prox Skip

2γ2A β1 = 0, β1 + β2 2γ = 0,

which gives β1 = 2γ2(A + W A) and β2 = 2γ(1 γ(A + W A)). Then we have

E[Ψold t+1] (1 p2)τ 2 ht h 2 + (1 γµ(2 γ(A + W A))) xt x 2

+ γ2(B + W B)σt + γ2(C + W C)

(1 g ζold)Ψold t + γ2(C + W C),

g ζold = min(γµ(2 γ(A + W A)), p2, 1 (B + W B)/W).

Apply Gronwall s inequality, we have

E Ψold T (1 g ζold)T Ψold 0 + γ2(C + W C)

C.3. New Analysis

Proof of Theorem 6.12. Based on the same argument as step 1 in Appendix A.3, we have following middle result similar to Equation (8):

E[Ψnew t+1] (ˆxt+1 x ) τ(ht h ) 2 + 2(1 )τ(1 p) ˆxt+1 x , ht h

+ γ2WE[σt+1].

Step 2 (expand the gradient descent) Expand ˆxt+1 according to Line 4 in Algorithm 6 gives

E[Ψnew t+1] γ2E gt(xt) f(x ) 2 2γ xt x , E[gt] f(x )

+ 2γ (τ γ) ht h , E[gt] f(x )

+ 2 (γ τ) ht h , xt x

+ (γ τ) (γ (2 1) τ) ht h 2 + xt x 2

+ γ2WE[σt+1]

2γ (τ γ) f(xt) f(x ), ht h

2γ f(xt) f(x ), xt x

+ (γ τ) (γ (2 1) τ) ht h 2

+ 2 (γ τ) ht h , xt x

+ (xt x )2 + 2(A + W A)γ2Df(x, x ) + γ2(C + W C)

+ γ2(B + W B)σt.

Step 3 (apply strongly convexity and smoothness) We have

xt x , f(xt) f(x ) 1

L f(xt) f(x ) 2,

xt x , f(xt) f(x ) Df(xt, x ) + µ

xt x , f(xt) f(x ) µ xt x 2.

Tighter Analysis for Prox Skip

Additionally, we have c1(ht h ) c2(xt x ) c3( f(xt) f(x )) 2 0.

We also have xt x 2 + τ 2 ht h 2 2 τ xt x , ht h 2 = Ψnew t .

Apply the above four inequalities and one equality with additional multipliers α, β1, β2, 1 and (1 ζ), we have

E[Ψnew t+1] ζ + c2 2 (β1/2 + β2)µ xt+1 x 2

+ 2 γ2 2 γτ γ2 + ζτ 2 + c2 1 ht h 2

f(xt) f(x ) 2

2 γ2 γτ + c1c3 f(xt) f(x ), ht h

+ (α + β1 + β2 2γ + 2c2c3) f(xt) f(x ), xt x

+ 2 ( γ ζτ c1c2) ht h , xt x

+ (2γ2(A + W A) β1)Df(xt x )

+ (1 ζ)Ψnew t + γ2(C + W C)

+ γ2(B + W B)σt.

α + β1 + β2 2γ + 2c2c3 = 0, (27)

2 γ2 2 γτ γ2 + ζτ 2 + c2 1 = 0, (28)

L = 0, (29)

ζ + c2 2 (β1/2 + β2)µ = 0, (30)

γ2 γτ + c1c3 = 0, (31)

γ ζτ c1c2 = 0. (32)

2γ2(A + W A) β1 = 0, (33)

Solving c1, c2, c3, α, β1 and β2 on Equations (28) to (33) gives

2 γ2 + 2 γτ + γ2 ζτ 2,

c2 = (γ ζτ)

c3 = γ (τ γ)

β1 = 2γ2(A + W A),

β2 = (A + W A)γ2µ + ζ + c2 2 µ .

Applying above solution into Equation (27) gives

E1 :=Lγ4µ 2Lγ3µτ + Lγ2µτ 2 2γ3µ + 2γ2µζτ + 2γ2µτ + γ2

2γµζτ 2 2γζτ + ζ2τ 2

+ 2A γ4µ + 2A γ3µτ + 4γ3µ 4γ2µτ 2γ2ζ + 2γζτ

+ A γ4µ A γ2µζτ 2 2γ3µ + γ2ζ + 2γµζτ 2 ζ2τ 2

For simplicity, in above formula, we use A = A + W A.

Tighter Analysis for Prox Skip

Step 4 (optimize free parameter ) E1(ζ, ) = 0 is an implicit function, and we wish the decreasing speed ζ to be optimized. Therefore, we require

which further implies A γ2µ 2γµ + ζ γ2 γτ γ2 + ζτ 2 = 0

According to the analysis for special case in Appendix A.3, the first root γµ(2 γA ) is discarded. And we proceed with second root ζ = γ(γ+ (τ γ)))

τ 2 . Then, E1 = 0 implies

3 2 (µτ + 1) 2 + Lµτ 2 + 2µτ + 2 µτ 2(2 γA ) γ

τ γ = 0. (34)

Finally, we have

E[Ψnew t+1] max((1 ζ), (B + W B)/W)Ψnew t + γ2(C + W C).

We define g ζnew = min(ζ, 1 (B + W B)/W).

Apply Gronwall s inequality, we have

E [Ψnew T ] (1 g ζnew)T Ψnew 0 + γ2(C + W C)

D. Grad Skip+

D.1. Algorithm

Algorithm 7 Grad Skip+

1: Parameters: stepsize γ > 0, compressors Cω Bd(ω) and CΩ Bd(Ω). 2: Input: initial iterate x0 Rd, initial control variate h0 Rd, number of iterations T 1. 3: for t = 0, 1, ..., T 1 do 4: ˆht+1 = f(xt) (I + Ω) 1CΩ( f(xt) ht) Update the shift ˆhi,t via shifted compression 5: ˆxt+1 = xt γ( f(xt) ˆht+1) Update the iterate ˆxi,t via shifted gradient step 6: ˆgt = 1 γ(1+ω)Cω ˆxt+1 proxγ(1+ω)ψ ˆxt+1 γ(1 + ω)ˆht+1 Estimate the proximal gradient 7: xt+1 = ˆxt+1 γˆgt Update the main iterate xi,t 8: ht+1 = ˆht+1 + 1 γ(1+ω) xt+1 ˆxt+1 Update the main shift hi,t 9: end for

D.2. Useful Lemmas

p := 1 1 + ω (0, 1],

τ := γ(1 + ω).

Lemma D.1. For any t 0,

τ(ht+1 h ) (xt+1 x ) 2 = τ(ˆht+1 h ) (ˆxt+1 x ) 2 .

Tighter Analysis for Prox Skip

Proof of Lemma D.1. According to Line 8 in Algorithm 7,

τht+1 xt+1 = τˆht+1 ˆxt+1,

τ(ht+1 h ) (xt+1 x ) = τ(ˆht+1 h ) (ˆxt+1 x ).

Lemma D.2. For any t 0, let ECω,t[ ] be the expectation over the randomness from unbiased compressor Cω at t-th step,

ECω,t xt+1 x , ht+1 h 1 1 1 + ω

ˆxt+1 x , ˆht+1 h .

Proof of Lemma D.2. We define

x+ t+1 = proxτψ ˆxt+1 τˆht+1 ,

h+ t+1 = ˆht+1 + 1

τ (x+ t+1 ˆxt+1).

According to Lemma 5.3, x+ t+1 x , h+ t+1 h 0.

Then we let s = 1 1+ωCω(ˆxt+1 x+ t+1), and we have

xt+1 = ˆxt+1 s,

ht+1 = ˆht+1 1

ECω,t[s] = 1 1 + ω (ˆxt+1 x+ t+1),

ECω,t s 2 1 1 + ω

ˆxt+1 x+ t+1 2.

ECω,t xt+1 x , ht+1 h

τ ˆxt+1 s x , τˆht+1 s h

τ ˆxt+1 ECω,t[s] x , τˆht+1 ECω,t[s] h

ECω,t s 2 ECω,t[s] 2

τ ˆxt+1 p(ˆxt+1 x+ t+1) x , τˆht+1 p(ˆxt+1 x+ t+1) h

τ p(1 p) ˆxt+1 x+ t+1 2

=(1 p) ˆxt+1 x , ˆht+1 h + p x+ t+1 x , h+ t+1 h

(1 p) ˆxt+1 x , ˆht+1 h .

D.3. Old Analysis

Proof of Theorem 2.1.

Ψold t+1 = xt+1 x 2 + τ 2 ht+1 h 2

= (xt+1 x ) τ(ht+1 h ) 2 + 2τ xt+1 x , ht+1 h .

Tighter Analysis for Prox Skip

Step 1 (expand the proximal operator) According to Lemmas D.1 and D.2,

ECω,t[Ψold t+1]

(ˆxt+1 x ) τ(ˆht+1 h ) 2 + 2τ(1 p) ˆxt+1 x , ˆht+1 h .

Step 2 (expand the gradient descent) Let

s = (I + Ω) 1CΩ( f(xt) ht),

ˆht+1 = f(xt) s,

ˆxt+1 = xt γs.

ECω,t[Ψold t+1]

(ˆxt+1 x ) τ(ˆht+1 h ) 2 + 2τ(1 p) ˆxt+1 x , ˆht+1 h

=τ 2 f(xt) f(x ) 2 2γ f(xt) f(x ), xt+1 x

+ xt+1 x 2 2(τ 2 γ2) f(xt) f(x ), s + (τ 2 γ2) s 2,

E[Ψold t+1]

τ 2 f(xt) f(x ) 2 2γ f(xt) f(x ), xt+1 x

+ xt+1 x 2 2(τ 2 γ2) f(xt) f(x ), E[s] + (τ 2 γ2)E s 2

(τ 2 γ2) f(xt) f(x ) 2 (I+Ω) 1 + τ 2 f(xt) f(x ) 2

2γ f(xt) f(x ), xt x + (τ 2 γ2) ht h 2 (I+Ω) 1 + xt x 2

= f(xt) f(x ), (τ 2I (τ 2 γ2)(I + Ω) 1)( f(xt) f(x ))

2γ f(xt) f(x ), xt x + (τ 2 γ2) ht h 2 (I+Ω) 1 + xt x 2

= f(xt) f(x ), (γ2 Ω)( f(xt) f(x ))

2γ f(xt) f(x ), xt x + (τ 2 γ2) ht h 2 (I+Ω) 1 + xt x 2.

The last step is due to the following equivalent formula of Ω:

Ω:= (1 + ω)2I ω(2 + ω)(I + Ω) 1.

Step 3 (apply strongly convexity and smoothness) We have

xt x , f(xt) f(x ) f(xt) f(x ) 2 L 1

xt x , f(xt) f(x ) µ xt x 2.

Apply the above two inequalities with additional multipliers α and β, we have

E[Ψold t+1] (1 p2)τ 2 ht h 2 (I+Ω) 1 + (1 βµ) xt x 2

+ f(xt) f(x ), (γ2 Ω αL 1)( f(xt) f(x ))

+ (α + β 2γ) xt x , f(xt) f(x ) .

Tighter Analysis for Prox Skip

λmax(γ2 Ω αL 1) 0,

The smallest α we that satisfy above condition is α = γ2λmax(L Ω). We also require α + β 2γ = 0, so we have

E[Ψold t+1] (1 p2)τ 2 ht h 2 (I+Ω) 1 + (1 γµ(2 γλmax(L Ω))) xt x 2

(1 p2)τ 2λmax((I + Ω) 1) ht h 2

+ (1 γµ(2 γλmax(L Ω))) xt x 2

(1 ζold)Ψold t

ζold = min γµ(2 γλmax(L Ω)), 1 1 p2

1 + λmin(Ω)

D.4. New Analysis

Proof of Theorem 3.1. We only need to focus on γ > γcrit case.

Ψnew t+1 = xt+1 x 2 + τ 2 ht+1 h 2 2 τ xt+1 x , ht+1 h

= (xt+1 x ) τ(ht+1 h ) 2 + 2(1 )τ xt+1 x , ht+1 h .

Step 1 (expand the proximal operator) According to Lemmas D.1 and D.2,

ECω,t[Ψnew t+1]

(ˆxt+1 x ) τ(ˆht+1 h ) 2 + 2(1 )τ(1 p) ˆxt+1 x , ˆht+1 h .

Step 2 (expand the gradient descent) Let

s = (I + Ω) 1Cω( f(xt) ht),

ˆht+1 = f(xt) s,

ˆxt+1 = xt γs,

ECω,t[Ψnew t+1]

(ˆxt+1 x ) τ(ˆht+1 h ) 2 + 2(1 )τ(1 p) ˆxt+1 x , ˆht+1 h

=τ 2 f(xt) f(x ) 2 2( (τ γ) + γ) f(xt) f(x ), xt x

2(τ γ)((1 )γ + τ) f(xt) f(x ), s

+ xt x 2 + 2 (τ γ) xt x , s + (τ γ)((1 2 )γ + τ) s 2,

Tighter Analysis for Prox Skip

E[Ψnew t+1]

τ 2 f(xt) f(x ) 2 2( (τ γ) + γ) f(xt) f(x ), xt x

2(τ γ)((1 )γ + τ) f(xt) f(x ), E[s]

+ xt x 2 + 2 (τ γ) xt x , E[s] + (τ γ)((1 2 )γ + τ)E s 2

f(xt) f(x ), (τ 2 (τ 2 γ2)(I + Ω) 1)( f(xt) f(x ))

+ 2 γ(τ γ) f(xt) f(x ), (I + Ω) 1(ht h )

+ f(xt) f(x ), ( 2γI 2 (τ γ)(I (I Ω) 1))(xt x )

+ (γ τ)(2 γ γ τ) ht h , (I + Ω) 1(ht h )

+ 2 (γ τ) ht h , (I + Ω) 1(xt x ) + xt x 2.

Step 3 (apply strongly convexity and smoothness) We have

xt x , f(xt) f(x ) f(xt) f(x ) 2 L 1,

xt x , f(xt) f(x ) µ xt x 2.

Additionally, for any positive semidefinite matrices c1, c2, c3, we have

c1(ht h ) c2(xt x ) c3( f(xt) f(x )) 2 0.

We also have xt x 2 + τ 2 ht h 2 2 τ xt x , ht h 2 = Ψnew t .

Apply the above three inequalities and one equality with additional multipliers α, β, 1 and (1 ζ), we have

E[Ψnew t+1]

τ 2 f(xt) f(x ) 2 2( (τ γ) + γ) f(xt) f(x ), xt x

2(τ γ)((1 )γ + τ) f(xt) f(x ), E[s]

+ xt x 2 + 2 (τ γ) xt x , E[s] + (τ γ)((1 2 )γ + τ)E s 2

f(xt) f(x ), (c2 3 + τ 2I (τ 2 γ2)(I + Ω) 1 L 1α)( f(xt) f(x ))

+ f(xt) f(x ), ( c3c1 + 2 γ(τ γ)(I + Ω) 1)(ht h )

+ f(xt) f(x ), ((α + β 2γ)I + 2c3c2 2 (τ γ)(I (I Ω) 1))(xt x )

+ ht h , ( τ 2(1 ζ)I + c2 1 + (γ τ)(2 γ γ τ)(I + Ω) 1)(ht h )

+ 2 ht h , ( τ(1 ζ)I c1c2 + (γ τ)(I + Ω) 1)(xt x )

+ xt x , ((1 βµ + ζ)I + c2 2)(xt x )

+ (1 ζ)Ψnew t .

c2 3 + τ 2I (τ 2 γ2)(I + Ω) 1 L 1α 0, (35)

τ 2(1 ζ)I + c2 1 + (γ τ)(2 γ γ τ)(I + Ω) 1 0, (36)

(1 βµ + ζ)I + c2 2 0, (37)

c3c1 + 2 γ(τ γ)(I + Ω) 1 = 0, (38)

(α + β 2γ)I + 2c3c2 2 (τ γ)(I (I Ω) 1) = 0, (39)

τ(1 ζ)I c1c2 + (γ τ)(I + Ω) 1 = 0. (40)

Tighter Analysis for Prox Skip

Then we have E[Ψnew t+1] (1 ζ)Ψnew t .

Inspired by an analysis on the special case where L and Ωare isotropic, we manually choose following parameters:

c1 = γ(τ γ)(I Ω) 1,

c2 = (τ(1 ζ) (τ γ)(I Ω) 1)c 1 1 ,

With above parameter, Equations (38) to (40) holds when α + β = 2γ 2 (γ ζτ). And Equations (35) to (37) becomes

γ2 Ω L 1α 0, (41)

τ 2 Ω 0, (42)

E1 := ( βµ + ζ)I + τ(1 ζ) (I Ω) 1(τ γ) 2

(I Ω) 1γ (τ γ) 0. (43)

α = γ2λmax(L Ω),

ζ = λmin( Ω)

β = 2γ ( (ζ (ω + 1) 1) + 1) α.

which satisfy Equations (41) and (42). Therefore we only need to make sure Equation (43) holds. It is guaranteed some feasible solution with (0, 1) exist, because at the left end point, we have

λmax(E1)| =0 = γµ(2 γλmax(L Ω)) + 1 1 p2

1 + λmin(Ω)

and due to γ > γcrit, we have E1| =0 0.