# nonconvex_stochastic_composite_optimization_with_polyak_momentum__352bbf96.pdf

Non-convex Stochastic Composite Optimization with Polyak Momentum

Yuan Gao 1 2 Anton Rodomanov 1 Sebastian U. Stich 1

The stochastic proximal gradient method is a powerful generalization of the widely used stochastic gradient descent (SGD) method and has found numerous applications in Machine Learning. However, it is notoriously known that this method fails to converge in non-convex settings where the stochastic noise is significant (i.e. when only small or bounded batch sizes are used). In this paper, we focus on the stochastic proximal gradient method with Polyak momentum. We prove this method attains an optimal convergence rate for non-convex composite optimization problems, regardless of batch size. Additionally, we rigorously analyze the variance reduction effect of the Polyak momentum in the composite optimization setting and we show the method also converges when the proximal step can only be solved inexactly. Finally, we provide numerical experiments to validate our theoretical results.

1. Introduction

Stochastic gradient descent and its variants are the workhorse of modern machine learning. The stochastic proximal gradient method is a simple yet powerful extension of the vanilla stochastic gradient descent method, which aims to solve the following stochastic composite optimization problem:

min x Rd{F(x) := f(x) + ψ(x)} , (1)

where f : Rd R is smooth and ψ: Rd R {+ } is not necessarily differentiable, but a simple function. Such paradigm is ubiquitous in machine learning and beyond, and it covers a wide range of generalizations of the vanilla optimization problem, including regularized machine learning problems (Liu et al., 2015), signal processing (Combettes &

1CISPA, Saarbrücken, Germany 2Universität des Saarlandes. Correspondence to: Yuan Gao <yuan.gao@cispa.de>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Pesquet, 2010), image processing (Luke, 2020) and many more. It also naturally covers constrained optimization problems by considering the indicator function of the constraint set, and some recent variants of such ideas have been applied to distributed and federated machine learning problems (Mishchenko et al., 2022).

The problem formulation (1) is often used in Machine Learning as the proximal term ψ allows encoding prior domain knowledge. For instance, it can impose constraints on the variables, handle non-differentiability, induce sparsity, or it can take the form of a regularizer. Recently, there has been a surge of interest in these preconditioning techniques in the Deep Learning context (Hendrikx et al., 2020; Woodworth et al., 2023). Given a loss objective function ℓ: Rd R, an auxiliary function (or preconditioner) ˆℓis constructed for instance, by approximating ℓwith a subset of the data samples or synthetic samples. This results in the formulation:

min x Rd{ℓ(x) ˆℓ(x) | {z } f(x)

+ ˆℓ(x) |{z} ψ(x)

This is a special case of Problem (1), with f := ℓ ˆℓand ψ := ˆℓ. This formulation has the advantage that the training can be accelerated if the proxy function ˆℓremains simple to optimize. These applications motivate the need to further our understanding of the stochastic composite optimization problem in the non-convex regime, especially where large batches are less preferred or even unavailable (Le Cun et al., 2012; Rieke et al., 2020).

In the convex and strongly convex case, the complexity of solving Problem (1) is well understood (Ghadimi et al., 2016; Ghadimi & Lan, 2013a). On the other hand, the existing theory in the non-convex regime is unsatisfactory. In particular, when the gradient noise has variance σ2, with the vanilla algorithm, the squared norm of the gradient of F is only shown to converge to O(1/K) + Ω(σ2) after K iterations. This implies that, to converge to an arbitrary ε error, one needs to take mega batches of size Ω(ε 1) at each iteration of the algorithm to reduce the stochastic noise term; and the total number of stochastic gradient oracle calls is Ω(ε 2). In practice and theory, smaller-batch methods are often preferred over mega-batch methods. For example, mega-batch methods follow the full gradient methods much more closely at each step than smaller batch methods, and it is observed empirically to be adversarial to the generaliza-

Non-convex Stochastic Composite Optimization with Polyak Momentum

tion performance (Wilson & Martinez, 2003; Le Cun et al., 2012; Keskar et al., 2017), and theoretically (Sekhari et al., 2021) for certain Machine Learning tasks. Furthermore, in many practical settings, mega-batches are unavailable or intractable to sample, e.g., in medical tasks (Rieke et al., 2020); federated Reinforcement Learning (Khodadadian et al., 2022; Jin et al., 2022); and multi-agent Reinforcement Learning (Doan et al., 2019).

In this work, we revisit the stochastic composite optimization problem in the non-convex regime and show that the Polyak momentum technique addresses the aforementioned mega-batch issue while retaining the optimal convergence rate O(ε 2). Polyak momentum is the de-facto standard for training Deep Learning models in practice (Kingma & Ba, 2014), and understanding its theoretical efficacy in various settings is an active research direction (Cutkosky & Mehta, 2020; Fatkhullin et al., 2023).

1.1. Stochastic Proximal Gradient Method

To solve the stochastic composite optimization problem (1), we consider the stochastic proximal gradient method, which solves the following subproblem in each iteration (commonly known as the proximal step): find xk+1 such that

Ωk(xk+1) Ωk(xk) and Ωk(xk+1) = 0,

where Ωk(x) := gk, x + ψ(x) + Mk

2 x xk 2. Here, gk denotes a random vector computed at each iteration k using the stochastic first-order oracle of f, and Mk > 0 is a regularization (or stepsize) parameter. Mk can be a user defined constant, or it can be a non-decreasing sequence. For the vanilla stochastic proximal gradient method, gk is a stochastic gradient of f at xk.

When ψ is convex, the proximal step is equivalent to minimizing Ωk:

xk+1 = arg min x Rd

n gk, x + ψ(x) + Mk

2 x xk 2o . (2)

We note that when ψ 0, this reduces to the vanilla SGD with step size 1 Mk . When ψ 0, it has long been known that if error-dependent batch sizes are not allowed, then such a method is only proved to converge to a neighborhood of the stationary point of F up to the variance of the gradient noise (Ghadimi et al., 2016). In Section 4 we give a simple example to illustrate this phenomenon. Therefore, one needs ε-dependent mini-batches (or mega-batches) to converge to an ε neighborhood of the stationary point of F.

1.2. Our Contributions

Stochastic Polyak momentum has seen a huge success in practice (Kingma & Ba, 2014). In this work, we study the effect of Polyak momentum in the non-convex regime and

provide a theoretical analysis of momentum s stabilizing effect in the stochastic composite optimization setting, particularly in the small batch regime.

First, we establish a lower bound result for the vanilla stochastic proximal gradient method, showing that it cannot converge to the stationary point beyond the variance of the gradient noise.

We study the effect of incorporating momentum into the stochastic proximal gradient method and prove its optimal convergence to the stationary point without any error-dependent mega-batch access. We also rigorously study the variance reduction effect of momentum in the stochastic composite optimization setting.

We further extend our analysis to the case where the proximal steps are solved inexactly and give the same convergence guarantees, demonstrating the robustness of the momentum method.

Finally, we conduct numerical experiments to corroborate our theoretical findings and demonstrate the algorithm s practicality.

2. Related Works

There is a huge body of work on stochastic and composite optimization. Here, we focus on the stochastic and nonconvex regime and do not get into the details of the works in the convex (e.g. Ghadimi & Lan, 2012; 2013b) or the deterministic regime (see Nesterov, 2013). The first work that considers the stochastic composite optimization problem in the non-convex case appears to be (Ghadimi et al., 2016), in which they established the convergence of the stochastic proximal gradient method to a neighborhood of the stationary point of F up to the variance of the gradient noise. If mega-batches are allowed, they showed that the algorithm takes asymptotically O(ε 2) stochastic gradient oracle calls to converge to an ε neighborhood of the stationary point of F. It was later extended to incorporate the acceleration technique, but the convergence still requires mega-batches. In particular, it requires k samples at the kth iteration while the total number of oracle calls is not improved (Ghadimi & Lan, 2013a). Even though these methods require mega-batches, their upper bounds on the total number of stochastic gradient oracle calls match the lower bound asympotically (Arjevani et al., 2023).

In another research direction, to break the O(ε 2) lower bound for stochastic non-convex optimization, there have been a long line of works on variance reduction techniques, including Prox-Spiderboost (Wang et al., 2019) (composite variant of Spider (Fang et al., 2018)), Hybrid-SGD (Tran Dinh et al., 2022), and PStorm (Xu & Xu, 2022) (composite variant of Storm (Cutkosky & Orabona, 2019)). These methods achieve a O(ε 3/2) asymptotic complexity at the cost

Non-convex Stochastic Composite Optimization with Polyak Momentum

of two additional assumptions on the objective and problem structure:

f is e L-average smooth: E[ f(x, ξ) f(y, ξ) 2] e L2 x y 2. Note that this assumption might not be satisfied for some simple smooth functions, and it is strictly stronger than the standard smoothness assumption.1

These variance reduction techniques require the access to two stochastic gradients f(xk, ξk) and F(xk 1, ξk) at iteration k, which might not always be available in practice (e.g. Doan et al., 2019; Chen et al., 2022).

Convergence Criteria The first-order convergence criteria for non-convex optimization problems is an ongoing research topic. Our work mostly focuses on the convergence in terms of F(xk) 2, which we think is the most natural criteria. Many early works study the convergence of problem (1) in terms of the proximal gradient mapping (Ghadimi et al., 2016; Ghadimi & Lan, 2013a; Wang et al., 2019; Tran-Dinh et al., 2022; Xu & Xu, 2022). More recently, there have been several works proposing to study the convergence in terms of Moreau envelope and discussing how the different criterias affect the corresponding convergence complexities (Davis & Drusvyatskiy, 2019; Zhang et al., 2020). We discuss the differences and connections between these definitions in Appendix G.

Stochastic Polyak Momentum The idea of using Polyak momentum (a.k.a. heavy-ball momentum) was first proposed in (Polyak, 1964) for strongly convex quadratic objective in the deterministic case. The first non-asymptotic analysis of SGD with Polyak momentum in the smooth non-convex regime is given in (Yu et al., 2019) and was later refined in (Liu et al., 2020). Some recent works also studied how Polyak momentum can be used to remove the dependence on large batches in various settings, e.g. for normalized SGD (Cutkosky & Mehta, 2020) and for communication compressed optimization (Fatkhullin et al., 2023). The analysis of stochastic gradient methods with Polyak momentum remains an active research topic (Wang & Abernethy, 2020; Sebbouh et al., 2021; Li et al., 2022; Jelassi & Li, 2022).

3. Problem Formulation and Assumptions

In this work, we consider the composite optimization problem: min x Rd{F(x) := f(x) + ψ(x)} .

In the main body of the paper we study the convergence of the algorithms in terms of F(x) 2 where we assume that

1By Jensen s inequality, we have that any e L-average smooth function is also e L-smooth, see also Assumption 3.3. There exists L-smooth functions that are not average smooth.

ψ is differentiable. We discuss the case where ψ is convex and non-differentiable in Appendix F. We do not consider the non-convex and non-differentiable case for ψ, as this is an exotic scenario in the literature.

Note that most of the existing works on stochastic composite optimization assume that ψ is convex and analyze the convergence of their algorithms in terms of the proximal gradient mapping (Ghadimi et al., 2016; Ghadimi & Lan, 2013a; Wang et al., 2019; Tran-Dinh et al., 2022; Xu & Xu, 2022), which is closely related to what we consider in this work. In Appendix G, we discuss the proximal gradient mapping and argue that our definition is more natural in the non-convex regime.

We first introduce the following assumption on the monotonicity of the proximal step: Assumption 3.1. We assume that, at each iteration k of the algorithms, the output xk+1 of the proximal step satisfies Ωk(xk+1) Ωk(xk).

This is a technical assumption that provides a more nuanced characterization of ψ beyond convexity, but all our theories work with just the simple convexity assumption as well (in fact, several constants can be improved under the convexity assumption). It is natural to assume that the proximal step finds a no worse point than the current iterate, even when ψ is non-convex.

Next, we introduce the lower boundedness assumption of the objective function F: Assumption 3.2. We assume that there exists F R such that F F(x), x Rd.

Now, we introduce the smoothness assumption on f. Note that this is a weaker assumption than the average smoothness assumption used in the variance reduction type methods (Wang et al., 2019; Tran-Dinh et al., 2022; Xu & Xu, 2022). Assumption 3.3. We assume that the function f has LLipschitz gradient, i.e. for any x, y Rd, we have:

f(x) f(y) L x y .

Next, we make an assumption on the noise of the gradient oracle of f. Assumption 3.4. We assume that we have access to a gradient oracle f(x, ξ) for f such that for all x Rd it holds that: E[ f(x, ξ)] = f(x) ,

E[ f(x, ξ) f(x) 2] σ2. (3)

This is a standard assumption in the analysis of stochastic gradient methods (Nemirovskij & Yudin, 1983; Bubeck et al., 2015). Taking mini-batches is equivalent to dividing the variance by the batch size.

Non-convex Stochastic Composite Optimization with Polyak Momentum

4. Lower Bound for the Vanilla Stochastic Proximal Gradient Method

In Section 1.1 we briefly discussed that the vanilla stochastic proximal gradient method cannot converge to the stationary point of F beyond the variance of the gradient noise. We give a simple example to illustrate this phenomenon and give some intuitions on why momentum resolves the issue.

Proposition 4.1. For any K 1 and any (predefined) stepsize coefficients {Mk}K 1 k=0 (possibly depending on the problem parameters L, σ2 and K), there exists a problem instance of (1) with f(x) := L

2 x 2 and ψ(x) := a

2 x 2, a := max0 k K 1 Mk, satisfying Assumptions 3.2 and 3.3, and the stochastic gradient oracle f(x, ξ) := f(x) + ξ, ξ N(0, σ2I), satisfying Assumption 3.4, such that, for the sequence {xk}K k=1 generated by method (2) started from any initial point x0, it holds that E[ F(xk) 2] 1

4σ2 for any 1 k K.

Proof. Clearly, the construction satisfies our assumptions. Let us fix an arbitrary 0 k K 1. Setting the gradient of the auxiliary function in (2) to zero, we see that gk + axk+1 + Mk(xk+1 xk) = 0. Since gk = Lxk + ξk, it follows that

xk+1 = Mk L

Mk + a xk ξk Mk + a.

Substituting F(xk+1) = (L + a)xk+1, we obtain

E[ F(xk+1) 2] = (L + a)2

(Mk + a)2 E[ (Mk L)xk ξk 2]

(Mk + a)2 (Mk L)2E[ xk 2] + σ2 (L + a)2σ2

(Mk + a)2 .

If a Mk, then L+a Mk+a 1

2, and the claim follows.

Note that Proposition 4.1 holds for any initial point x0. Even if we start at the optimal point x0 = 0, the very first step of the method will already incur an O(σ2) error and no subsequent steps will be able to reduce it.

It is important that, for the composite optimization problem (1), we do not make any significant assumptions on ψ. In particular, we do not assume that ψ is smooth with a certain smoothness constant. Proposition 4.1 demonstrates that, without such extra assumptions, for any fixed choice of parameters for the stochastic proximal gradient method (2), i.e., the stepsize coefficients Mk and the number of iterations K, there is always a bad function in our problem class (namely, F(x) = f(x) + ψ(x) with f(x) = L

and ψ(x) = a

2 x 2 for a sufficiently large a) for which the method cannot reach any error < 1

4σ2 after K steps. In other words, for any given target accuracy ε < 1

4σ2, it is impossible to find one specific choice of the parameters for

Algorithm 1 Proximal Gradient Method with Polyak Momentum

1: Input: x0, m 1 and {Mk} k=0 , {γk} k= 1 , {δk} k=0, 2: for k = 0, 1, 2, . . . do 3: Compute gk = f(xk, ξk) 4: Update mk = (1 γk 1)mk 1 + γk 1gk 5: Compute approximate stationary point xk+1 of Ωk(x) := mk, x + ψ(x) + Mk

2 x xk 2 such that Ωk(xk+1) Ωk(xk) and Ωk(xk+1) 2 δk 6: end for

the method allowing it to reach the ε-error on any problem from our class.

The problem is that the variance of the gradient noise keeps the iterates away from the stationary point of F, and we need some mechanism reducing this variance with time. One such mechanism is the Polyak momentum which we discuss next.

5. The Algorithm and Analysis

In this section, we present the stochastic proximal gradient method with the Polyak momentum and establish its convergence guarantees in the non-convex regime.

The method is shown in Algorithm 1. In the algorithm, the stochastic gradient gk at each iteration k is replaced by the momentum-aggregated gradient mean:

mk = (1 γk 1)mk 1 + γk 1gk .

Our analysis will show that the distance between the momentum and the full gradient decreases as the number of iterations increases, which is the key to resolving the issue of the vanilla stochastic proximal gradient method.

In this section, we assume that the proximal steps are solved exactly, i.e. we further make the following assumption: Assumption 5.1. At each iteration k, we have δk = 0, i.e. Ωk(xk+1) = 0.

We relax Assumption 5.1 and discuss the inexact proximal steps in Section 7.

5.1. Convergence Analysis

We discuss the convergence analysis here, and missing proofs can be found in Appendix B. The convergence analysis of Algorithm 1 revolves around the following quantities:

Fk := E[F(xk) F ] , k := E[ mk f(xk) 2] , Rk := E[ xk+1 xk 2] . (4)

Fk quantifies the distance between the current objective value and the lower bound. k bounds the distance be-

Non-convex Stochastic Composite Optimization with Polyak Momentum

tween the current gradient estimate and the full gradient. Rk bounds the distance between two consecutive iterates. We start by giving a descent lemma on k, which is key to analyzing the variance reduction effect of momentum. Similar statements can be found in (Cutkosky & Mehta, 2020) and (Fatkhullin et al., 2023).

Lemma 5.2. Under Assumption 3.4, for any k 0:

k+1 (1 γk) k + L2

γk Rk + γ2 kσ2 .

The (1 γk) factor in the above lemma is the key to show that k decreases in the optimization process. Next, we discuss the per-iteration descent of Fk.

Recall that in this section, we have Assumption 5.1, which is equivalent to the following stationarity assumption:

ψ(xk+1) + mk + Mk(xk+1 xk) = 0. (5)

When ψ is convex and non-differentiable, Ωk(xk+1) is just a subgradient of Ωk at xk+1 that equals to zero, whose existence is guaranteed as the optimality condition. In the following lemmas and theorems, we can simply replace the gradient of Ωand ψ with such choices of the subgradients and obtain the same results for convex and non-differentiable ψ.

In Section 7, we give an approximate stationarity assumption, which is more realistic in practice and shows the same convergence guarantees.

Now we give the following descent lemma on Fk:

Lemma 5.3. Under Assumptions 3.1, 3.3 and 5.1, for any k 0, we have

Fk+1 Fk Mk L

4 Rk + k Mk L.

Remark 5.4. The constants in Lemma 5.3 can be slightly improved if Assumption 3.1 is replaced by the convexity of ψ.

Proof. By Assumption 3.3, we have:

F(xk+1) = f(xk+1) + ψ(xk+1)

f(xk) + f(xk), xk+1 xk

2 xk+1 xk 2 + ψ(xk+1)

= f(xk) + mk, xk+1 xk + ψ(xk+1)

2 xk+1 xk 2 Mk L

2 xk+1 xk 2

+ f(xk) mk, xk+1 xk .

Then we apply Assumption 3.1 and notice that Ωk(xk) = ψ(xk) + mk, xk :

F(xk+1) F(xk) Mk L

2 xk+1 xk 2

+ f(xk) mk, xk+1 xk .

Now we apply Young s equality and get

F(xk+1) F(xk) Mk L

4 xk+1 xk 2

+ mk f(xk) 2

We get the desired result by subtracting F and taking expectation on both sides.

Note that these will be enough if we aim to prove convergence in terms of the distance between the iterates, but not enough to guarantee a convergence in terms of E[ F(xk) 2]. Therefore, we relate the distance between iterates and the norm of the gradient in the following lemma:

Lemma 5.5. Under Assumptions 3.3 and 5.1, for any k 0,

(M 2 k + L2)Rk 1

3E[ F(xk+1) 2] k.

Proof. We can split F(xk+1) in the following way:

F(xk+1) = f(xk+1) + ψ(xk+1)

= mk + ψ(xk+1) + ( f(xk) mk)

+ ( f(xk+1) f(xk)) .

F(xk+1) 2 3 mk + ψ(xk+1) 2

+ 3 mk f(xk) 2 + 3 f(xk+1) f(xk) 2.

Now apply the stationarity condition (5) on the first term, and use Assumption 3.3 on the third term, we get:

F(xk+1) 2 3(M 2 k + L2) xk+1 xk 2

+ 3 mk f(xk) 2 .

Rearranging and taking expectations, we get the claim.

Now, we can piece together all of the above lemmas and consider the following Lyapunov function:

Φk := Fk + a k , (6)

where a is a constant to be determined later. Note that a does not have an impact on the algorithm itself, and it only shows up in the analysis. Now we give the following lemma on Φk:

Non-convex Stochastic Composite Optimization with Polyak Momentum

Lemma 5.6. Let Assumptions 3.1, 3.3, 3.4 and 5.1 hold, and let a := 3 8L, γk := 3L Mk L and Mk > 4L for any k 0. Then, for any k 0,

Φk+1 Φk 1 48Mk E[ F(xk+1) 2] + 27Lσ2

4M 2 k . (7)

We have the following simple corollary for constant stepsize coefficients:

Corollary 5.7. Let Algorithm 1 be run for K 1 iterations for solving problem (1) under Assumptions 3.3, 3.4 and 5.1,

with constant coefficients Mk = M = 4L + 3 3/2

Φ0 and γk = 3L M L for any 0 k K 1, where Φ0 := F(x0) F + 3 8LE[ m0 f(x0) 2]. Then,

E[ F(xt) 2] 48(3 3/2)

where t is chosen uniformly at random from {1, . . . , K}.

It is also possible to use time-varying coefficients Mk which do not require fixing the number of iterations in advance. However, in terms of the convergence rate, it incurs an extra logarithmic factor.

Corollary 5.8. Consider Algorithm 1 for solving problem (1) under Assumptions 3.3, 3.4 and 5.1 with coefficients

Φ0 , 4L , γk = 3L Mk L for any k 0,

where Φ0 := F(x0) F + 3

8LE[ m0 f(x0) 2]. Then, for any k 1, we have

E[ F(xt(k)) 2] 372 ln(ek) r

where t(k) is chosen randomly from {1, . . . , k} with probabilities Pr(t(k) = i) 1 Mi 1 , i = 1, . . . , k, and e := exp(1).

Let us point out that using a random iterate as the output of the algorithm is standard in the literature (see, e.g., (Rakhlin et al., 2012)) and can be efficiently implemented without fixing the number of iterations in advance. We discuss this more carefully in Appendix E.

5.2. Initialization and Convergence Guarantees

As we can see from Corollaries 5.7 and 5.8, the convergence rate of Algorithm 1 depends on Φ0 := O(F0 + 1

L 0), where 0 := E[ m0 f(x0) 2] = E[ (1 γ 1)m 1 + γ 1g0 f(x0) 2] depends on m 1. There are subtle differences in how m 1 can be initialized between the noncomposite and composite cases.

Non-Composite Case: When ψ 0, i.e. F f, we set the initial momentum m 1 := γ 1 1 γ 1 g0, then 0 =

E[ m0 f(x0) 2] = f(x0) 2 2LF0, where the last inequality follows from Assumption 3.3 and the fact that F f. Therefore, in the non-composite case, we can initialize the parameters such that Φ0 = O(F0).

Composite Case: The composite case is slightly trickier. We set m 1 := g0 and get that m0 = g0. Hence 0 = E[ g0 f(x0) 2] σ2. Therefore, Φ0 = O(F0 + σ2

L ). When σ2 = O(LF0), we get the same Φ0 = O(F0) as in the non-composite case.

Mini-Batch Initialization: In the case that σ2 is much larger than F0, if we have access to a constant size (not depending on the target error) mini-batch initially, then we can set g0 = 1 b0 Pb0 i=1 f(x0, ξi) where b0 := σ2

LF0 , i.e. g0 is a mini-batch stochastic gradient of size b0. Then we have Φ0 = O(F0 + σ2

b0L) = O(F0), which is the same as in the non-composite case.

We have thus proved the following convergence guarantee for Algorithm 1.

Theorem 5.9. Consider Algorithm 1, as applied to solving problem (1) under Assumptions 3.3, 3.4 and 5.1, run for K = O LΦ0σ2

ε iterations with constant coefficients

Mk = M = 4L + 3 3/2

Φ0 and γk = 3L M L for any

0 k K 1, where Φ0 := F0+ 3

8LE[ m0 f(x0) 2], F0 := F(x0) F and ε > 0 is a given target error. Then, for the point xt chosen uniformly at random from {x1, . . . , x K} it holds that E[ F(xt) 2] ε.

If ψ 0, we can initialize m 1 in such a way that K = O LF0σ2

ε . Otherwise, we can initialize m 1 in such a way that K = O LF0σ2

ε . Further, when the initial mini-batch of size σ2

LF0 is allowed, we can

initialize m 1 in such a way that K = O LF0σ2

When an initial mini-batch is not allowed, the convergence rate has an extra O( σ4

ε ) term that is not present in the non-composite case. One natural question is whether this is an artifact of our analysis or inherent to the problem and the algorithm. Note that the first step of the Algorithm 1 coincides with the vanilla stochastic proximal gradient method. Consider the lower bound construction in Proposition 4.1: when starting at x0 = 0, which is a stationary point such that F0 = 0, one such step would incur an O(σ2) error. Therefore, in the non-composite case without the initial mini-batch, the convergence rate must have some term that only depends on σ2. In other words, the extra terms seem unavoidable.

Non-convex Stochastic Composite Optimization with Polyak Momentum

6. Variance Reduction Effect of Momentum

In Section 4, we demonstrated that the vanilla stochastic proximal gradient method cannot converge because the gradient noise variance keeps the iterates away from the stationary point of F. In this section, we show that the momentum term in Algorithm 1 can reduce the variance of the gradient noise at the same rate as the gradient norm. This is the key to the convergence of Algorithm 1 without batches. This variance reduction effect has been known in practice and implicitly used in the analysis of (Cutkosky & Mehta, 2020) and (Fatkhullin et al., 2023). We precisely characterize such an effect in the composite optimization setting. Proofs are deferred to Appendix C.

We begin by refining the result of Lemma 5.6:

Lemma 6.1. Let Assumptions 3.1, 3.3, 3.4 and 5.1 hold, and let a :=

2 8L , γk := 3

2L Mk L and Mk > (1 + 3

2)L for any k 0. Then, for any k 0,

Φk+1 Φk 1 48Mk E[ F(xk+1) 2]

2 2Mk k + 27

2L 4M 2 k σ2. (10)

With Lemma 6.1, we can now precisely quantify the variance reduction effect of the momentum method.

Theorem 6.2. Consider Algorithm 1, as applied to solving problem (1) under Assumptions 3.3, 3.4 and 5.1, run for K = O LΦ0σ2

ε iterations with constant coefficients

Mk = M = (1 + 3

Φ0 and γk = 3

2L M L for any 0 k K 1, where Φ0 := F0 +

2 8L E[ m0 f(x0) 2], F0 := F(x0) F and ε > 0 is a given target error. Then, for the point xt chosen uniformly at random from {x1, . . . , x K} it holds that E[ mt f(xt) 2] ε.

In words, the squared distance between the momentum mk and the full gradient f(xk) decreases at the same rate as the squared norm of the gradient F(xk).

7. Inexact Proximal Step

While many existing works (e.g. (Ghadimi et al., 2016; Ghadimi & Lan, 2013a; Wang et al., 2019; Hendrikx et al., 2020; Tran-Dinh et al., 2022; Xu & Xu, 2022)) rely on the assumption that the proximal step can be solved exactly, this might not always be practically possible (Barré et al., 2023). In this section, we briefly discuss extending our analysis to the case where the proximal step is solved inexactly and give an inexactness criterion similar to that of (Woodworth et al., 2023).

A crucial element in our previous analysis is the assumption that Ωk(xk+1) = 0, i.e. Equation (5). Therefore, defining

the inexactness criteria as an approximate stationarity condition of Ωk at xk+1 where δk = 0 is natural. In particular, we define the criteria as follows:

E[ Ωk(xk+1) 2] M 2 k 16 E[ xk+1 xk 2] + Sk, (11)

where Sk will be decided later. Again, it should be understood that in the non-differentiable case Ωk(xk+1) is a certain subgradient of Ωk at xk+1 such that Equation (11) holds.

Now we state the convergence result (proofs are deferred to Appendix D):

Theorem 7.1. Consider Algorithm 1, as applied to solving problem (1) under Assumptions 3.1, 3.3 and 3.4, and the approximate stationarity condition at each iteration k: E[ Ωk(xk+1) 2] M 2

16 E[ xk+1 xk 2] + Sk,, run for

K = O LΦ0σ2

ε iterations with constant coeffi-

cients Mk = M = 4L + q

Φ0 and γk = q

17 L M L for any 0 k K 1, where Φ + 0 = F0 + q

19 156 E[ m0 f(x0) 2]

L , F0 := F(x0) F and ε > 0 is a given target error. Then, for the point x t chosen uniformly at random from {x1, . . . , x K} it holds that E[ F(xt) 2] ε

K PK 1 k=0 Sk. In particular, if for any 0 k K 1, Sk ε 16, then E[ F(xt) 2] ε.

It remains to discuss how to minimize Ωk such that the inexactness criteria (11) is satisfied. There is a line of research on minimizing gradient norm with SGD variants, e.g. (Allen-Zhu, 2018), and one can also exploit the structure information in ψ if there is any. Here, we state that SGD suffices for our purpose and give the following proposition, which is a simple modification of Proposition 2.6 in (Woodworth et al., 2023):

Proposition 7.2. If ψ is convex and Lψ-smooth, and we have access to an unbiased gradient oracle of ψ with variance at most σ2 ψ, then after at most

T = O Lψ + M

M ln Lψ + M

M + (Lψ + M)σ2 ψ MSk

iterations of SGD, the output ˆx of SGD satisfies the condition (11).

The proof of Proposition 7.2 is simple, and we refer interested readers to (Woodworth et al., 2023) for the discussions therein.

8. Experiments

In this section, we conduct numerical experiments to corroborate our theoretical findings and demonstrate the practical effectiveness of Algorithm 1.

Non-convex Stochastic Composite Optimization with Polyak Momentum

100 101 102 103 104 105 Gradient samples

100 101 102 103 104 105

Momentum Vanilla, batch 64 Vanilla, batch 16 Vanilla, batch 1

100 101 102 103 104 105 Gradient samples

Momentum Vanilla, batch 64 Vanilla, batch 16 Vanilla, batch 1

100 101 102 103 104 105 Gradient samples

Momentum Vanilla, batch 64 Vanilla, batch 16 Vanilla, batch 1

Figure 1: Comparison of Algorithm 1 and the vanilla stochastic proximal gradient method on the synthetic quadratic problem. For the vanilla stohastic proximal methods, we also highlight the smoothed curves on top of the original curves that oscillate much more. The left, middle, and right figures correspond to σ = 5, 25, 125, respectively. The vanilla stochastic proximal gradient method uses batch sizes 1, 16, 64. The x-axis represents the number of gradient samples and is truncated to only show the first 105 gradient samples.

8.1. Synthetic Quadratic Problem

We first consider a synthetic quadratic problem inspired by our lower bound construction in Proposition 4.1. We consider the following problem in Rd: we set f(x) = L x 2

2 and ψ(x) = a x 2

2 , where a is sufficiently large. Note that the vanilla stochastic proximal gradient method and Algorithm 1 are oblivious to the parameter a. We add Gaussian noise to the gradients to control the stochastic level σ2 of the stochastic oracle. We simulate the batch sample by dividing the variance by the batch size.

In Figure 1, we compare the performance of Algorithm 1 and the vanilla stochastic proximal gradient method. We set d = 5, L = 1 and a = 104. We run the vanilla method with batch sizes 1, 16, 64. We set σ = 5, 25, 125 respectively. The parameter M is tuned by a grid search in {100, 101, 102, 103, 104} for all methods, and the momentum parameter γ is tuned by a grid search in {10 1, 10 2, 10 3, 10 4, 10 5}. We set the maximum number of iterations to be 104, and the tolerance is 0.02. We see that as σ2 increases, Algorithm 1 still reaches the desired tolerance, while the vanilla method with batch size 1 fails to converge in all cases, and with batch sizes 16 and 64 only converges when σ = 5. In particular, with batch size 1, the error of vanilla method oscillates around 22 with σ2 = 52, around 672 with σ2 = 252, and around 14381 with σ2 = 1252. In other words, the error of the vanilla method is indeed proportional to σ2, as predicted by our lower bound result in Proposition 4.1. We also have that for Algorithm 1, M is set to be 10, 10, 100 and γ is set to be 0.01, 0.001, 0.0001 for σ = 5, 25, 125 respectively, which is consistent with our theoretical prediction that M should increase while γ should decrease as σ increases. We also point out that the momentum method exhibits a jump in the error in the first several iterations, consistent with our analysis that the first step of the momentum method incurs an O(σ2) error as well.

8.2. Regularized Machine Learning Experiment

Now we consider the classical application of composite optimization: regularized machine learning (Liu et al., 2015). We use the ℓ ,1 regularizer to regularize the weights on each layer. The proximal step with the ℓ ,1 regularizer is implemented in (Murray et al., 2019). We evaluate the performances of Algorithm 1 and the vanilla stochastic proximal gradient method on the Cifar-10 dataset (Krizhevsky et al., 2014) with the Resnet-18 (He et al., 2016). The regularization parameter is set to be 0.1, which is observed in (Murray et al., 2019) to achive a balance between enforcing sparsity in the model and maintaining the model performance. We use a batch size of 256 and run 300 epochs. We use the standard step size parameter M = 10 (corresponding to a learning rate of 0.1) for the experiment. We apply a multi-step learning rate scheduler at 150 epoch and 250 epoch, with a decay factor of 0.1. For Algorithm 1, momentum parameter γ is set to be 0.1 by a grid search.

0 50 100 150 200 250 300 Epoch

Momentum 0.1 Vanilla

0 50 100 150 200 250 300 Epoch

Test accuracy

Momentum 0.1 Vanilla

Figure 2: Comparison of Algorithm 1 and the vanilla stochastic proximal gradient method for the ℓ ,1 regularized machine learning problem on Cifar-10 dataset, with Resnet-18. The left and right figures correspond to the training loss and test accuracy, respectively.

We summarize the performances of the two methods in Section 8.2. We see that Algorithm 1 outperforms the vanilla method in terms of both training loss and test accuracy. In terms of the test accuracy, Algorithm 1 displays a much smoother curve than that of the vanilla method.

In connection to our theoretical observation that the step size parameter M should increase while the momentum parame-

Non-convex Stochastic Composite Optimization with Polyak Momentum

ter γ should decrease as the stochastic level σ increases, we observe that, grid searches with respect to the train loss lead to the choices M = 100, 100, 10 and γ = 0.1, 0.1, 0.1 for batch sizes 64, 128 and 256. It appears that the momentum parameter γ is less sensitive to the batch size than the step size parameter M, and setting γ = 0.1 might be a good choice in practice. We note that while there is certainly a correlation between the batch size and the stochastic level σ2, we do not have a direct control over σ2, as compared to the synthetic problem.

9. Conclusion

In this work we revisit the non-convex stochastic composite optimization problem, and address its convergence issue in the small batch regime. We show that the vanilla stochastic proximal method cannot converge to the stationary point beyond the variance of the gradient noise. We analyze the immensely successful Polyak momentum method in this context and establish its optimal convergence rate without any batch size requirement, demonstrating its superiority over the vanilla method. We conduct numerical experiments to corroborate our theoretical findings. In light of the past successes of proximal methods in ML, and the recent emerging application scenarios for proximal methods in DL our findings reinforce the robustness and the potential of the Polyak momentum method.

Impact Statement

This paper presents work that aims to advance the fields of optimization and Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Acknowledgments

We acknowledge partial funding from Helmholtz AI (project Opt4Bio).

Allen-Zhu, Z. How to make the gradients small stochastically: Even faster convex and nonconvex sgd. Advances in Neural Information Processing Systems, 31, 2018.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199 (1-2):165 214, 2023.

Barré, M., Taylor, A. B., and Bach, F. Principled analyses and design of first-order methods with inexact proximal operators. Mathematical Programming, 201(1):185 230, 2023.

Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015.

Chen, X., He, N., Hu, Y., and Ye, Z. Efficient algorithms for minimizing compositions of convex functions and random functions and its applications in network revenue management. ar Xiv preprint ar Xiv:2205.01774, 2022.

Combettes, P. L. and Pesquet, J.-C. Proximal splitting methods in signal processing, 2010.

Cutkosky, A. and Mehta, H. Momentum improves normalized sgd. In International conference on machine learning, pp. 2260 2268. PMLR, 2020.

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.

Davis, D. and Drusvyatskiy, D. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019.

Doan, T., Maguluri, S., and Romberg, J. Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 1626 1635. PMLR, 2019.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Nearoptimal non-convex optimization via stochastic pathintegrated differential estimator. Advances in neural information processing systems, 31, 2018.

Fatkhullin, I., Tyurin, A., and Richtárik, P. Momentum provably improves error feedback! In Advances in Neural Information Processing Systems, 2023.

Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469 1492, 2012.

Ghadimi, S. and Lan, G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming, 2013a.

Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061 2089, 2013b.

Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1): 267 305, 2016.

Non-convex Stochastic Composite Optimization with Polyak Momentum

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hendrikx, H., Xiao, L., Bubeck, S., Bach, F., and Massoulie, L. Statistically preconditioned accelerated gradient method for distributed optimization. In International conference on machine learning, pp. 4203 4227. PMLR, 2020.

Jelassi, S. and Li, Y. Towards understanding how momentum improves generalization in deep learning. In International Conference on Machine Learning, pp. 9965 10040. PMLR, 2022.

Jin, H., Peng, Y., Yang, W., Wang, S., and Zhang, Z. Federated reinforcement learning with environment heterogeneity. In International Conference on Artificial Intelligence and Statistics, pp. 18 37. PMLR, 2022.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. In Proceedings of International Conference on Learning Representations, 2017.

Khodadadian, S., Sharma, P., Joshi, G., and Maguluri, S. T. Federated reinforcement learning: Linear speedup under markovian sampling. In International Conference on Machine Learning, pp. 10997 11057. PMLR, 2022.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 dataset, 2014.

Le Cun, Y., Bottou, L., Orr, G., and Muller, K.-R. Efficient backprop. Neural Networks: Tricks of the Trade, Springer-Verlag, 2012.

Li, X., Liu, M., and Orabona, F. On the last iterate convergence of momentum methods. In International Conference on Algorithmic Learning Theory, pp. 699 717. PMLR, 2022.

Liu, B., Wang, M., Foroosh, H., Tappen, M., and Pensky, M. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

Liu, Y., Gao, Y., and Yin, W. An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33:18261 18271, 2020.

Luke, D. R. Proximal Methods for Image Processing, pp. 165 202. Springer International Publishing, Cham, 2020. ISBN 978-3-030-34413-9. doi: 10.1007/978-3030-34413-9_6. URL https://doi.org/10.1007/ 978-3-030-34413-9_6.

Mishchenko, K., Malinovsky, G., Stich, S., and Richtárik, P. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! In International Conference on Machine Learning, pp. 15750 15769. PMLR, 2022.

Murray, K., Kinnison, J., Nguyen, T. Q., Scheirer, W., and Chiang, D. Auto-sizing the transformer network: Improving speed, efficiency, and performance for low-resource machine translation. In Proceedings of the Third Workshop on Neural Generation and Translation, 2019.

Nemirovskij, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

Polyak, B. T. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1 17, 1964.

Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1571 1578, 2012.

Rieke, N., Hancox, J., Li, W., Milletari, F., Roth, H. R., Albarqouni, S., Bakas, S., Galtier, M. N., Landman, B. A., Maier-Hein, K., et al. The future of digital health with federated learning. NPJ digital medicine, 3(1):119, 2020.

Sebbouh, O., Gower, R. M., and Defazio, A. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, pp. 3935 3971. PMLR, 2021.

Sekhari, A., Sridharan, K., and Kale, S. Sgd: The role of implicit regularization, batch-size and multiple-epochs. Advances In Neural Information Processing Systems, 34: 27422 27433, 2021.

Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. A hybrid stochastic optimization framework for composite nonconvex optimization. Mathematical Programming, 191(2):1005 1071, 2022.

Wang, J.-K. and Abernethy, J. Quickly finding a benign region via heavy ball momentum in non-convex optimization. ar Xiv preprint ar Xiv:2010.01449, 2020.

Non-convex Stochastic Composite Optimization with Polyak Momentum

Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems, 32, 2019.

Wilson, R. and Martinez, T. The general inefficiency of batch training for gradient descent learning. Neural Networks, 2003.

Woodworth, B., Mishchenko, K., and Bach, F. Two losses are better than one: Faster optimization using a cheaper proxy, 2023.

Xu, Y. and Xu, Y. Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization, 2022.

Yu, H., Jin, R., and Yang, S. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In International Conference on Machine Learning, pp. 7184 7193. PMLR, 2019.

Zhang, J., Lin, H., Jegelka, S., Sra, S., and Jadbabaie, A. Complexity of finding stationary points of nonconvex nonsmooth functions. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 11173 11182. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/ v119/zhang20p.html.

Non-convex Stochastic Composite Optimization with Polyak Momentum

0 5 10 15 20 25 30 35 40 45 50 55

Running loss

Momentum Vanilla

0 5 10 15 20 25 30 35 40 45 50 55

Test accuracy

Momentum Vanilla

Figure 3: Comparison of Algorithm 1 and the vanilla stochastic proximal gradient method for the statistical preconditioning technique on Cifar-10 dataset. The left and right figures correspond to the training loss and test accuracy, respectively.

A. Additional Experiments

In this section we consider the recent statistical preconditioner (proxy training) technique of (Hendrikx et al., 2020; Woodworth et al., 2023), where for some objective ℓ, we consider f = ℓ ˆℓand ψ = ˆℓ. ˆℓis a statistical preconditioner defined on a sub-sample of the training dataset. Here we simulate the setup on Cifar-10 dataset (Krizhevsky et al., 2014). ℓis defined on the whole 50000 training images, and ˆℓis defined on a subset of 2560 training images. We follow the implementation of (Woodworth et al., 2023), with one difference: at each iteration k, (Woodworth et al., 2023) computes the full gradient ˆℓ(xk) while we only compute a stochastic gradient of batch size 128. In the experiment, we use batch size of 512 for ℓ, and a batch size of 128 for the SGD updates on Ωk. We perform a grid-search on the parameters M and γ. The SGD on Ωk takes 20 iterations and a step-size 0.01, which is tuned in (Woodworth et al., 2023). The experiments demonstrate that SGD is effective and reliable for solving the proximal step with sufficient accuracy (see also the experiments in (Woodworth et al., 2023)). We see that the momentum method outperforms the vanilla method, and the convergence of the momentum method seems smoother than the vanilla method.

B. Missing Proofs in Section 5

We start by giving the proof of Lemma 5.2:

Lemma 5.2. Under Assumption 3.4, for any k 0:

k+1 (1 γk) k + L2

γk Rk + γ2 kσ2 .

Proof. Indeed,

k+1 = E[ mk+1 f(xk+1) 2] = E[ (1 γk)(mk f(xk+1)) + γk(gk+1 f(xk+1)) 2]

= (1 γk)2E[ mk f(xk+1) 2] + E[γ2 k gk+1 f(xk+1) 2]

(1 γk)2E[ mk f(xk) + f(xk) f(xk+1) 2] + γ2 kσ2.

For the second identity, we have used the fact that gk+1 is unbiased. By Young s inequality, for any α > 0, we have:

k+1 (1 γk)2(1 + α) k + (1 γk)2(1 + α 1)E[ f(xk) f(xk+1) 2] + γ2 kσ2

(1 γk)2(1 + α) k + (1 γk)2(1 + α 1)L2Rk + γ2 kσ2,

where the last inequality follows from the smoothness of f. Choosing now α := γk 1 γk , we get the claimed inequality.

Now we give the proof of Lemma 5.6:

Lemma 5.6. Let Assumptions 3.1, 3.3, 3.4 and 5.1 hold, and let a := 3 8L, γk := 3L Mk L and Mk > 4L for any k 0. Then, for any k 0,

Φk+1 Φk 1 48Mk E[ F(xk+1) 2] + 27Lσ2

4M 2 k . (7)

Non-convex Stochastic Composite Optimization with Polyak Momentum

Proof. We put together Lemmas 5.2 and 5.3 and get:

Φk+1 = Fk+1 + a k+1 Fk Mk L

4 Rk + k Mk L + a h (1 γk) k + L2

γk Rk + γ2 kσ2i

= Fk Hk Rk + aγ2 kσ2 + 1 γk + 1 a(Mk L)

where Hk := Mk L

γk . If Hk > 0, then we can plug Lemma 5.5 in and get:

Φk+1 Fk Hk M 2 k + L2

3E[ F(xk+1) 2] k

+ 1 γk + 1 a(Mk L)

a k + aγ2 kσ2

= Fk Hk 3(M 2 k + L2)E[ F(xk+1) 2] + 1 γk + 1 a(Mk L) + Hk a(M 2 k + L2)

a k + aγ2 kσ2.

Now let us choose a = γk(Mk L)

8L2 , so that Hk = Mk L

8 > 0. Then,

1 a(Mk L) + Hk a(M 2 k + L2) = 8L2

γk(Mk L)2 + L2

γk(M 2 k + L2) = L2

8 (Mk L)2 + 1 M 2 k + L2

Therefore we need L2

γk 8 (Mk L)2 + 1 M 2 k+L2 γk or, equivalently, γ2 k L2 8 (Mk L)2 + 1 M 2 k+L2 . Since (Mk L)2

M 2 k + L2, it suffices to set γ2 k = 9L2 (Mk L)2 , i.e., γk = 3L Mk L. Note that this requires Mk > 4L as we want to keep γk < 1. For our value of γk, we get a = 3 8L. Putting everything together, we obtain

Φk+1 Φk Mk L 24(M 2 k + L2)E[ F(xk+1) 2] + 27Lσ2

Since Mk L M 2 k+L2 1 2(Mk L) 1 2Mk , we can further estimate

Φk+1 Φk 1 48Mk E[ F(xk+1) 2] + 27Lσ2

Corollary 5.7. Let Algorithm 1 be run for K 1 iterations for solving problem (1) under Assumptions 3.3, 3.4 and 5.1,

with constant coefficients Mk = M = 4L + 3 3/2

Φ0 and γk = 3L M L for any 0 k K 1, where Φ0 :=

F(x0) F + 3 8LE[ m0 f(x0) 2]. Then,

E[ F(xt) 2] 48(3 3/2)

where t is chosen uniformly at random from {1, . . . , K}.

Proof. Setting Mk := M some constant, rearranging and summing Equation (7) from 0 to K 1, we get

k=0 E[ F(xk+1) 2] Φ0 + 27KLσ2

Therefore 1 K

k=0 E[ F(xk+1) 2] 48MΦ0

Recall that M > 4L, then for M = 4L + 3 3/2

Φ0 . We have:

k=0 E[ F(xk+1) 2] 192LΦ0

K + 48(3 3/2)

Non-convex Stochastic Composite Optimization with Polyak Momentum

Therefore, after at most O( LΦ0σ2

ε ) iterations, we have 1

K PK 1 k=0 E[ F(xk+1) 2] ε.

If t is chosen uniformly at random from {1, . . . , K}, then E[ F(xt) 2] = 1

K PK 1 k=0 E[ F(xk+1) 2].

We also discuss the case where Mk is not a constant:

Corollary 5.8. Consider Algorithm 1 for solving problem (1) under Assumptions 3.3, 3.4 and 5.1 with coefficients

Φ0 , 4L , γk = 3L Mk L for any k 0, where Φ0 := F(x0) F + 3 8LE[ m0 f(x0) 2]. Then, for any k 1, we have

E[ F(xt(k)) 2] 372 ln(ek) r

where t(k) is chosen randomly from {1, . . . , k} with probabilities Pr(t(k) = i) 1 Mi 1 , i = 1, . . . , k, and e := exp(1).

Proof. Denote G2 k+1 := E[ F(xk+1) 2]. According to Lemma 5.6, we have

Φk+1 Φk 1 48Mk G2 k+1 + 27Lσ2

Rearranging and summing the above from 0 to K 1, where K 1 is arbitrary, we get:

1 Mk G2 k+1 Φ0 + 27Lσ2

Denoting Ai = Pi 1 k=0 1 Mk , we obtain

1 Mk G2 k+1 48Φ0 + 324Lσ2 PK 1 k=0 1 M 2 k AK ,

Hence, for Mk := max q

ϕ0 , 4L , we have

Φ0 k Lσ2 , 1

Φ0 KLσ2 , 1

48Φ0 + 324Lσ2 K 1 X

1 M 2 k 48Φ0 + 324Lσ2 K X

Φ0 k Lσ2 372Φ0

Putting everything together, we get:

1 Mk G2 k+1 372Φ0 PK 1 k=1 1 k

4L} 372Φ0(ln K + 1)

4L} 372(ln K + 1) r

If t(k) is chosen from {1, . . . , k} with probabilities Pr(t(k) = i) = 1 Mi Ak , then E[ F(xt(k))] = 1 Ak Pk 1 i=0 G2 i+1 Mi .

C. Missing Proofs in Section 6

Lemma C.1. Under Assumptions 3.1, 3.3, 3.4 and 5.1, for {xk}k>0 generated by Algorithm 1, if γk := 3

2L Mk L, a :=

2 8L , and Mk > (1 + 3

2)L, we have:

Φk+1 Φk 1 48Mk E[ F(xk+1) 2] 3

2 2Mk k + 27

Non-convex Stochastic Composite Optimization with Polyak Momentum

Proof. We repeat the proof of Lemma 5.6 but now require that L2

γk 8 (Mk L)2 + 1 M 2 k+L2 γk

2 . To satisfy this inequality, it

suffices to choose γk = 3

2L Mk L, which requires that Mk > (1 + 3

2)L. Now we have a =

Φk+1 Φk Mk L 24(M 2 k + L2)E[ F(xk+1) 2] 3

2L 2(Mk L) k + 27

Let Mk = τk L for some τk > (1 + 3

2). Note that, for τk > (1 + 3

2), we have τk 1

τ 2 k+1 1 2τk , 1 2(τk 1) 1 2τk and

1 (τk 1)2 2 τ 2 k . Therefore, for Mk > (1 + 3

Φk+1 Φk 1 48Mk E[ F(xk+1) 2] 3

2 2Mk k + 27

D. Missing Proofs in Section 7

First, we notice that, under Assumptions 3.1 and 3.3, Lemma 5.3 still holds in the inexact case. Now we give an analogous result to Lemma 5.5 in the inexact case:

Lemma D.1. Under Assumption 3.3, for {xk}k>0 generated by Algorithm 1, we have:

(M 2 k + L2)Rk 1

4E[ F(xk+1) 2] k E[ Ω(xk+1) 2].

Proof. Indeed,

F(xk+1) 2 = f(xk+1) + ψ(xk+1) 2

= mk + ψ(xk+1) + ( f(xk) mk) + ( f(xk+1) f(xk)) 2

= Mk(xk xk+1) + Ω(xk+1) + ( f(xk) mk) + ( f(xk+1) f(xk)) 2

4M 2 k xk+1 xk 2 + 4 Ω(xk+1) 2 + 4 mk f(xk) 2 + 4 f(xk+1) f(xk) 2

4(M 2 k + L2) xk+1 xk 2 + 4 mk f(xk) 2 + 4 Ω(xk+1) 2.

Taking expectations and rearranging, we obtain the claim.

Notice that Lemma 5.2 still holds. We continue to use the same Lyapunov function Φk := Fk + a k.

Lemma D.2. Under Assumptions 3.1, 3.3 and 3.4, for {xk}k>0 generated by Algorithm 1, if γk := q

17 L Mk L, a := q

19 408 1 L, that in each iteration condition (11) holds. Then, for Mk > 7L, we have:

Φk+1 Φk 1 68Mk E[ F(xk+1) 2] + 12Lσ2

M 2 k + 2Sk

17Mk . (12)

Proof. Denote G2 k+1 := E[ F(xk+1) 2]. Plugging Equation (11) into Lemma D.1, we obtain

(M 2 k + L2)Rk 1

4G2 k+1 k E[ Ω(xk+1) 2] 1

4G2 k+1 k M 2 k 16 Rk Sk.

Rearranging, we get: 17 16(M 2 k + L2)Rk 1

4G2 k+1 k Sk. (13)

Recall in the proof of Lemma 5.6, we have:

Φk+1 Fk Hk Rk + 1 γk + 1 a(Mk L)

a k + aγ2 kσ2,

Non-convex Stochastic Composite Optimization with Polyak Momentum

where Hk := Mk L

γk . If Hk > 0, then we can plug Equation (13) in and get:

Φk+1 Fk 16Hk 17(M 2 k + L2)

4G2 k+1 k Sk

+ 1 γk + 1 a(Mk L)

a k + aγ2 kσ2

= Fk 4Hk 17(M 2 k + L2)G2 k+1 + 1 γk + 1 a(Mk L) + 16Hk 17a(M 2 k + L2)

a k + aγ2 kσ2 + 16Hk Sk 17(M 2 k + L2).

Now let a = γk(Mk L)

8L2 , so that Hk = Mk L

1 a(Mk L) + 16Hk 17a(M 2 k + L2) = 8L2

γk(Mk L)2 + 16L2

17γk(M 2 k + L2) = 8L2

1 (Mk L)2 + 2 17(M 2 k + L2)

Therefore we need 8L2

γk 1 (Mk L)2 + 2 17(M 2 k+L2) γk or, equivalently, γ2 k 8L2 1 (Mk L)2 + 2 17(M 2 k+L2) . Since

(Mk L)2 M 2 k + L2, it suffices to set γ2 k = 152

17 L2 (Mk L)2 , i.e., γk = q

17 L Mk L. Note that this requires Mk >

17 , which is ensured whenever Mk > 4L. For our choice of γk, we get a = 1 8L q

19 156 1 L. Putting everything together, we get

Φk+1 Φk Mk L 34(M 2 k + L2)G2 k+1 + 4L (Mk L)2 σ2 + 2(Mk L)Sk

17(M 2 k + L2) .

Let Mk = τk L for some τk > 7. Note that τk 1

τ 2 k+1 1 2τk , 1 (τk 1)2 2 τ 2 k and τk 1

τ 2 k+1 1 τk . Therefore, for Mk > 7L, we get

Φk+1 Φk 1 68Mk G2 k+1 + 8L

M 2 k σ2 + 2Sk

Setting Mk = M some constant, and rearranging and summing Equation (12) over k = 0, . . . , K 1, we get:

k=0 E[ F(xk+1) 2] Φ0 + 8KL

M 2 σ2 + 2 17M

Therefore, 1 K

k=0 E[ F(xk+1) 2] 68MΦ0

Then for M = 7L + q

Φ0 , we have:

k=0 E[ F(xk+1) 2] O r

E. Sampling from a Stream of Data

Following Theorem 5.9, we briefly mentioned that one can efficiently sample the desired output point. In this section, we explain how to perform such sampling at no extra computation and memory cost. This might have been discussed in the literature, but we still provide a detailed explanation for completeness and the reader s convenience. Proposition E.1. Given a stream of points {xk} k=1 in Rd and positive scalars {hk} k=1, we can maintain, at each step k 1, the random variable xt(k), where t(k) is a random index from {1, . . . , k} chosen with probabilities Pr(t(k) = i) = hi Hk ,

i = 1, . . . , k, where Hk := Pk i=1 hi. This requires only O(d) memory and computation.

Proof. We maintain the variables ˆxk Rd and Hk R which are both initialized to 0 at step k = 0. Then, at each step k 1, we update Hk Hk 1 + hk and also, with probability hk

Hk , we update ˆxk xk (or, with probability 1 hk

Hk , keep the old ˆxk = ˆxk 1). The memory and computation costs are O(d). Note that, for any 1 i k, the event ˆxk = xi happens iff ˆx was updated at step i and then not updated at each step j = i + 1, . . . , k. Hence, for any 1 i k, we have

Pr(ˆxk = xi) = hi

Non-convex Stochastic Composite Optimization with Polyak Momentum

F. Convergence in the Non-Differentiable Case

In this section we briefly discuss the convergence of Algorithm 1 in the non-differentiable case. In this case we assume that ψ is convex. We assume that each subproblem is solved exactly, i.e. xk+1 = arg minx Ωk(x). In particular, this implies that there exists subgradient gψ k+1 ψ(xk+1) such that:

mk + gψ k+1 + M(xk+1 xk) = 0

Therefore, for convenience, we define ψ(xk+1) := gψ k+1 as the specific subgradient of ψ at xk+1 that we use. Similarly, we define F(xk+1) := f(xk+1) + gψ k+1 = f(xk+1) + ψ(xk+1) as the specific subgradient of F at xk+1 that we use. This way, all our previous proofs still hold, and we have the following:

Theorem F.1. Under Assumptions 3.1, 3.3 and 3.4, for {xk}k>0 generated by Algorithm 1, if Mk = M some constant,

γ := 3L M L, there exists M = 4L + 3 3/2

Φ0 such that K = O LΦ0σ2

ε iterations of Algorithm 1 is sufficient to

achieve E[ F(xt) 2] ε, where t is chosen from {1, . . . , K} uniformly at random.

We remark that this can be reformulated in terms of the distance between F and 0: we have dist2( F(xt), 0) ε.

G. Convergence Criterias

In this section, we discuss the differences and connections between the convergence criterias used in the literature.

Proximal Gradient Mapping: The most popular convergenec criteria used in most of the earlier works on non-convex composite optimization is proximal gradient mapping (Ghadimi et al., 2016; Ghadimi & Lan, 2013a; Wang et al., 2019; Hendrikx et al., 2020; Tran-Dinh et al., 2022; Xu & Xu, 2022). Proximal gradient mapping is defined in a very general context, where we consider the constrained composite optimization problem:

min x X F(x) := f(x) + ψ(x) ,

with the 1-strongly convex mirror map r : X R. For any vector g and x, and scalar M > 0, define:

x+ := arg min x X { g, x x + ψ(x) + βr(x , x)}

where βr(x , x) = r(x ) r(x) r(x), x x is the Bregman divergence of r. Then the proximal gradient mapping is defined as: PX(x, g, M) := M(x x+),

and in the literature, the convergence is studied in terms of PX(x, f(x), M) 2. We point out that it is very easy to prove analogous versions of Lemmas 5.3 and 5.5 in terms of PX(x, f(x), M) 2 (which implies that our results extend to the proximal gradient mapping case), but we omit the details here. Instead, we argue that the size of F(x) is a more natural convergence criterion. Consider the simplest situation where ψ is convex and differentiable, X = Rd, and the mirror map r(x) = 1

2 x 2 (i.e. the Euclidean geometry). Then we have the usual stationarity condition for x+:

f(x) + ψ(x+) + M(x+ x) = 0.

Therefore, we have: PX(x, f(x), M) = f(x) + ψ(x+).

In other words, the proximal gradient mapping gives the gradient of f at the current point plus the gradient of ψ at the next point. In contrast, in our analysis, we directly consider the gradient of F at the next point, and we believe that this is more natural and intuitive.

Moreau Envelope: A closely related convergence criterion that was proposed to address the non-convergence problem of proximal gradient mapping of the earlier works is the Moreau envelope (Davis & Drusvyatskiy, 2019). In Euclidean geometry, for some parameter λ, the Moreau envelope is defined as the following:

Fλ(x) := min y

2λ y x 2o ,

Non-convex Stochastic Composite Optimization with Polyak Momentum

and write ˆx = arg miny{F(y) + 1 2λ y x 2}. The convergence criteria is then defined as:

Fλ(x) := λ 1(x ˆx).

Equivalently, Fλ(x) = F(ˆx) in the differentiable case. In other words, the convergence criteria with Moreau envelope uses a surrogate point ˆx instead of the actual iterates of the algorithm. Note that, the convergence criteria using Moreau envelop and proximal gradient mapping is with a constant factor of each other for a ρ-weakly convex function (Davis & Drusvyatskiy, 2019):

1 4 F1/2ρ(x) P(x, f(x), ρ) 3