# revisiting_convergence_shuffling_complexity_beyond_lipschitz_smoothness__ef3b568b.pdf

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

Qi He 1 Peiran Yu 2 Ziyi Chen 1 Heng Huang 1

Shuffling-type gradient methods are favored in practice for their simplicity and rapid empirical performance. Despite extensive development of convergence guarantees under various assumptions in recent years, most require the Lipschitz smoothness condition, which is often not met in common machine learning models. We highlight this issue with specific counterexamples. To address this gap, we revisit the convergence rates of shuffling-type gradient methods without assuming Lipschitz smoothness. Using our stepsize strategy, the shuffling-type gradient algorithm not only converges under weaker assumptions but also match the current best-known convergence rates, thereby broadening its applicability. We prove the convergence rates for nonconvex, strongly convex, and non-strongly convex cases, each under both random reshuffling and arbitrary shuffling schemes, under a general bounded variance condition. Numerical experiments further validate the performance of our shuffling-type gradient algorithm, underscoring its practical efficacy.

1. Introduction

Gradient-based optimization has always been a critical area due to its extensive practical applications in machine learning, including reinforcement learning (Sutton and Barto, 2018), hyperparameter optimization (Feurer and Hutter, 2019), and large language models (Radford et al., 2018). While numerous gradient-based algorithms have been developed for convex functions (Nemirovskij and Yudin, 1983; Nesterov, 2013; d Aspremont et al., 2021), research on nonconvex functions has become particularly active in recent years, driven by advances in deep learning. Notably, with unbiased stochastic gradients and bounded variance, SGD

1Department of Computer Science, University of Maryland, College Park. 2Department of Computer Science and Engineering, University of Texas Arlington. Correspondence to: Qi He <qhe123@umd.edu>, Heng Huang <heng@umd.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

achieves an optimal complexity of O(ϵ 4) (Ghadimi and Lan, 2013), which matches the lower bound established by Arjevani et al. (2023).

In practice, however, random shuffling-type methods have demonstrated advantages over traditional SGD. These methods, which involve iterating over data in a permuted order rather than drawing i.i.d. samples, are widely used in deep learning frameworks due to their simplicity and computational efficiency. Empirical studies have shown that such methods often outperform standard SGD in terms of convergence speed and generalization ability (Bottou, 2009; 2012). Notably, shuffling-based training has been found to mitigate gradient variance, leading to smoother optimization trajectories compared to purely stochastic updates (Gürbüzbalaban et al., 2021; Hao Chen and Sra, 2019).

Theoretical analyses of shuffling-type methods have gained significant attention in recent years, aiming to explain their empirical success while tackling the unique challenges posed by the lack of independence between consecutive updates. Many early works focused on convex optimization, often relying on strong convexity assumptions to establish linear or sublinear convergence rates (Gürbüzbalaban et al., 2021; Hao Chen and Sra, 2019; Safran and Shamir, 2020). More recently, research efforts have extended beyond the convex setting (Mishchenko et al., 2020; Nguyen et al., 2021; Koloskova et al., 2023), suggesting that random reshuffling can exhibit faster convergence than SGD even in nonconvex settings under certain conditions, such as when gradient noise is structured or exhibits low variance.

Despite these advances, most theoretical analyses rely on the Lipschitz smoothness assumption, which imposes restrictive upper and lower bounds on the gradient variations. While this assumption holds in many standard optimization settings, it fails to capture a broad class of important machine learning models, including deep language models (Zhang et al., 2019), phase retrieval (Chen et al., 2023), and distributionally robust optimization (Chen et al., 2023). As a result, many practical scenarios remain outside the scope of existing theoretical guarantees. To address this gap, our work develops new techniques to analyze the convergence of shuffling-type gradient algorithms under relaxed smoothness assumptions, aiming to provide a more comprehensive theoretical foundation applicable to a wider range

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

of machine learning models.

We consider the following finite sum minimization problem:

i=1 f(w; i)

where f( ; i) : Rd R is smooth and possibly nonconvex for i [n] := {1, . . . , n}. Problem (P) covers empirical loss minimization as a special case, therefore can be viewed as formulation for many machine learning models, such as logistic regression, reinforcement learning, and neural networks.

We summarize our main contributions as follows:

We proved the convergence of the shuffling-type gradient algorithm under non-uniform smoothness assumptions, where the Hessian norm is bounded by a subquadratic function ℓof the gradient norm. With specific stepsizes and a general bounded variance condition, we achieved a total complexity of O(n p+1

2 ϵ 3) gradient evaluations for the nonconvex case with random reshuffling, and O(n p 2 +1ϵ 3) for arbitrary scheme, where 0 p < 2 is the degree of function ℓ. These results match those with Lipschitz smoothness assumptions in Nguyen et al. (2021) when p = 0 and ℓ-smoothness degenerates to Lipschitz smoothness.

For the strongly convex case, we established a complexity of e O(n p+1

2 ) for random reshuffling. In the nonstrongly convex case, the complexity is O(n p+1

2 ) for random reshuffling. Without assuming bounded variance, we established complexity of e O(nϵ 1

2 ) for arbitrary scheme in strongly convex case, and O(nϵ 3

2 ) in non-strongly convex case.

We conducted numerical experiments to demonstrate that the shuffling-type gradient algorithm converges faster than SGD on two important non-Lipschitz smooth applications.

2. Preliminaries

2.1. Shuffling-Type Gradient Algorithm

In practice, the random shuffling method has demonstrated its superiority over SGD, as shown in Bottou (2009) and Bottou (2012). Specifically, Bottou (2009) shows that shufflingtype methods achieve a convergence rate of approximately O(1/T 2), where T is the iteration count. Beyond shufflingtype stochastic gradient methods, variants such as SVRG have been applied in various scenarios, including decentralized optimization, as discussed in Shamir (2016) and De and Goldstein (2016).

The analysis of shuffling-type methods has a long history. For convex cases, Gürbüzbalaban et al. (2021) demonstrated that when the objective function is a sum of quadratics or smooth functions with a Lipschitz Hessian, and with a diminishing stepsize, the average of the last update in each epoch of RGA converges strictly faster than SGD with probability one. Additionally, they showed that when the number of epochs T is sufficiently large, the Reshuffling Gradient Algorithm (RGA) asymptotically converges at a rate of O(1/T 2). Similarly, Nguyen et al. (2021) established a convergence rate of O(1/T 2) for strongly convex and globally L-smooth functions. Furthermore, with uniform sampling and a bounded variance assumption or convexity on each component function, they showed that the convergence rate can be improved to O(1/n T 2).

In contrast, there is not much research on nonconvex cases. For example, Nguyen et al. (2021) demonstrated a convergence rate of O(T 2/3); Koloskova et al. (2023) proved

a convergence rate of O 1 T + min nσ

for single shuffling gradient method; Wang et al. (2024) analyzed Adam algorithm with random reshuffling scheme under (L0, L1)-smoothness but they did not explicitly show the convergence rate.

2.2. Counterexamples

Although L-smoothness is empirically false in LSTM (Zhang et al., 2019) and Transformers (Crawshaw et al., 2022), we give some concrete counterexamples to demonstrate the popularity of non-Lipschitz functions in this section. First we give two machine learning examples, then we mention some common non-Lipschitz functions.

Example 1. The first example is distributionally robust optimization (DRO), which is a popular optimization framework for training robust models. DRO is introduced to deal with the distribution shift between training and test datasets. In (Levy et al., 2020), it is formulated equivalently as follows.

min w W,θ R L(w, θ) := Eξ P ψ ℓ(w; ξ) θ

where w and θ are the parameters to be optimized, ξ is a sample randomly drawn from data distribution P, ℓ(w; ξ) is the loss function, ψ is the conjugate function of the divergence function ψ we choose to measure the difference between distributions, and λ > 0 is the regularization coefficient. It is proved in Jin et al. (2021) that L(w, θ) is not always Lipschitz-smooth even if ℓ(w; ξ) is Lipschitz-smooth and the variance is bounded.

Example 2. The second example is the phase retrieval problem. Phase retrieval is a nonconvex problem in X-

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

ray crystallography and diffraction imaging (Drenth, 2007; Miao et al., 1999). The goal is to recover the structure of a molecular object from intensity measurements. Let x Rd

be the true object and yr = |a r x|2 for r = 1, . . . , m, where ar Rd. The problem is to solve:

min z Rd f(z) := 1 2m

r=1 (yr |a r z|2)2. (2)

This objective function is a high-order polynomial in highdimensional space, thus it does not belong to the L-smooth function class.

Example 3. There are many common functions that are not Lipschitz smooth, including polynomial functions with order > 2, exponential functions, logarithmic functions and rational functions.

2.3. Relaxation of Lipschitz Smoothness

Because of the existence of these counterexamples, people have recently been investigating about smoothness assumptions that are more general than the traditional Lipschitz smoothness. In Zhang et al. (2019), (L0, L1)-smoothness was proposed as the first relaxed smoothness notion motivated by language modeling. It is defined as below:

Definition 2.1. ((L0, L1)-smoothness) A real-valued differentiable function f is (L0, L1)-smooth if there exist constants L0, L1 > 0 such that

2f(w) L0 + L1 f(w) .

Lipschitz smoothness can be viewed as a special case of (L0, L1) smoothness when L1 = 0. Under (L0, L1)- smoothness assumption, various convergence algorithms have been developed including clipped or normalized GD/SGD (Zhang et al., 2019), momentum accelerated clipped GD/SGD (Zhang et al., 2020), ADAM (Wang et al., 2022) and variance-reduced clipping (Reisizadeh et al., 2023) with optimal sample complexity on stochastic nonconvex optimization.

Other relaxed smoothness assumptions include asymmetric generalized smoothness motivated by distributionally robust optimization (Jin et al., 2021) and its extension to α-symmetric generalized smoothness (Chen et al., 2023) and ℓ-smoothness (Li et al., 2023a). In this paper, we use the definition of ℓ-smoothness as below:

Definition 2.2. (ℓ-smoothness) A real-valued differentiable function f is ℓ-smooth if there exists some nondecreasing continuous function ℓ: [0, + ) (0, + ) such that for any w dom(f) and constant C > 0,

B(w, C ℓ( f(w) +C)) dom(f); and for any w1, w2 B(w, C ℓ( f(w) +C)),

f(w1) f(w2) ℓ( f(w) + C) w1 w2 .

Another equivalent definition of ℓ-smoothness in Li et al. (2023a) is:

Definition 2.3. A real-valued differentiable function f is ℓ-smooth for some non-decreasing continuous function ℓ: [0, + ) (0, + ) if

2f(x) ℓ( f(x) )

almost everywhere.

In the proof of sections we focus on Definition 2.2, though Definition 2.3 provides a clearer perspective on the relationship between ℓ-smoothness and other smoothness notions. Both (L0, L1)-smoothness and traditional Lipschitz smoothness can be seen as special cases of ℓ-smoothness. It is straightforward to verify that the loss functions in phase retrieval and distributionally robust optimization (DRO) satisfy ℓ-smoothness. While extensive research has been conducted based on (L0, L1)-smoothness, comparatively less work has been done under the more general ℓ-smooth framework. Xian et al. established convergence results for generalized GDA/SGDA and GDmax/SGDmax in minimax optimization, while Zhang et al. (2024) analyzed MGDA for multi-objective optimization under ℓ-smoothness.

3. Algorithm

As demonstrated in our counterexamples, the Lipschitz smoothness assumption does not always hold in problem (P). In such non-Lipschitz scenarios, gradients can change drastically, posing a significant challenge for these algorithms. To address this issue, we propose a new stepsize strategy, detailed in Algorithm 1 and section 4, to improve performance under these challenging conditions. This strategy aims to choose the stepsize to accommodate the variance and instability in gradients, thereby enhancing the robustness of the optimization process.

In this algorithm, we start with an initial point w0. During each iteration t [T], either all the samples are shuffled, or we keep the order of the samples as in the last epoch. This reshuffling introduces variance in the order of samples, which can help mitigate issues related to gradient instability. For each step j [n], we use the gradient from a single sample with number π(t) j to update the weights w. The notation

π(t) j is used to denote the j-th element of the permutation π(t) for j [n]. Each outer loop through the data is counted as an epoch, and our convergence analysis focuses on the performance after the completion of each full epoch.

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

Algorithm 1 Shuffling-type Gradient Algorithm

Initialization: Choose an initial point w0 dom(F). for t = 1, 2, , T do

Set w(t) 0 := wt 1; Generate permutation π(t) of [n]. Compute non-increasing stepsize ηt. for j = 1, , n do

Update w(t) j := w(t) j 1 ηt

n f(w(t) j 1; π(t) j ). end for Set wt := w(t) n . end for

There are multiple strategies to determine π(t):

If π(t) is a fixed permutation of [n], Algorithm 1 functions as an incremental gradient method. This method maintains a consistent order of samples, which can simplify the analysis and implementation.

If π(t) is shuffled only once in the first iteration and then used in every subsequent iteration, Algorithm 1 operates as a shuffle-once algorithm. This strategy introduces randomness at the beginning but maintains a fixed order thereafter, providing a balance between randomness and stability.

If π(t) is regenerated in every single iteration, Algorithm 1 becomes a random reshuffling algorithm. This approach maximizes the randomness in the sample order, potentially offering the most robustness against the erratic behavior of non-Lipschitz gradients by constantly changing the sample order.

Although the random reshuffling scheme is most used in practice, each of these strategies offers distinct advantages and can be selected based on the specific requirements and characteristics of the optimization problem at hand. For this reason, we will give convergence rates for random and arbitrary shuffling scheme.

4. Convergence Analysis

4.1. Main Results

In this section, we present the main results of our convergence analysis. Our findings indicate that, with proper stepsizes, it is possible to achieve the same convergence rate, up to a logarithm difference, as under the Lipschitz smoothness assumption. First, we introduce the following assumptions regarding problem (P). Assumption 4.1 is a standard assumption, and Assumption 4.2 requires all F and f( ; i) to be ℓ-smooth.

Assumption 4.1. dom(F) := {w Rd : F(w) < + } = and F := infw Rd F(w) > .

Assumption 4.2. F and f( ; i) are ℓ-smooth for some subquadratic function ℓ, i [n].

Here, we assume all functions share the same ℓfunction without loss of generality, as we can always choose the pointwise maximum of all their ℓfunctions. We define p to be the degree of the ℓfunction such that p = supp 0{p| limw ℓ(w)

wp > 0}. Since ℓis sub-quadratic, we have 0 p < 2.

Next, we introduce our assumption about the gradient variances.

Assumption 4.3. There exist two constants σ, A (0, + ) such that i [n],

i=1 f(w; i) F(w) 2 A F(w) 2 + σ2, (3)

a.s., w dom(F).

Since component gradients behave as gradient estimations of the full gradient, this assumption can be viewed as a generalization of the more common assumption E[| F(w; ξ) F(w)|] σ2, which is used in most ℓ-smooth work.

4.1.1. NONCONVEX CASE

Let us denote 1 := F(w(1) 0 ) F . Under Assumptions 4.1 to 4.3, we have the following result for random shuffling scheme. Proofs can be found in Appendix A.1.1 and A.1.2.

Theorem 4.4. Suppose Assumptions 4.1, 4.2 and 4.3 hold, Let { wt}T t=1 be generated by Algorithm 1 with random reshuffling scheme. For any 0 < δ < 1, we denote H := 4 1

δ , G := sup{u 0|u2 2ℓ(2u) H}, G := p

2(1 + n A)G +

2nσ, L := ℓ(2G ). For any ϵ > 0, choose ηt and T such that

t=1 η3 t n 1

L2σ2 , T 32 1

then with probability at least 1 δ, we have F(w(t) 0 ) G for every 1 t T

t=1 F(w(t) 0 ) 2 ϵ2.

Remark 4.5. By choosing ηt = η = O( 3q

2 ϵ), we can achieve a complexity of T = O( n p 1

outer iterations and O( n p+1

2 ϵ3 ) total number of gradient evaluations, ignoring constants, where p is the order of the ℓ function in Definition 2.2. As p goes to 0, ℓ-smoothness degenerates to the traditional Lipschitz smoothness, and

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

our total number of gradient evaluations goes to O( n

ϵ3 ) once again, which matches the complexity in Corollary 1 of Nguyen et al. (2021). If ϵ 1/ n, one possible stepsize is η = nϵ 2L

Our result here has polynomial dependency on 1

δ, T = O(δ 3

2 p 2 p ). It is important to note that, in our setting, δ accounts for the probability that Lipschitz smoothness does not hold a consideration absent in standard Lipschitz smoothness settings. Therefore, the dependency here is not as good as in L-smoothness cases. In fact, a polynomial dependency on δ is typical in papers with similar smoothness assumptions, e.g. in Li et al. (2023a), Li et al. (2023b), Xian et al. and Zhang et al. (2024).

Next we consider arbitrary π(t) scheme in Algorithm 1.

Theorem 4.6. Suppose Assumptions 4.1, 4.2 and 4.3 hold. Let { wt}T t=1 be generated by Algorithm 1 with arbitrary scheme. Define H = 2 1, G := sup{u 0|u2 2ℓ(2u) H}, G := p

2(1 + n A)G +

2nσ, L := ℓ(2G ). For any ϵ > 0, choose ηt and T such that

2(3A + 2) ,

t=1 η3 t 2 1 3σ2L2 , T 8 1

then we have F(w(t) 0 ) G for every 1 t T and

t=1 F(w(t) 0 ) 2 ϵ2.

This theorem gives the convergence rate for arbitrary scheme

in Algorithm 1. By choosing ηt = η = O 3q

, we achieve a complexity of O n p 2 ϵ3 outer iter-

ations and O n p 2 +1

ϵ3 total gradient evaluations, ignoring constants. Without the randomness in π in every iteration, the complexity s dependency on n is increased by O( n). One possible stepsize is η = ϵ L

4.1.2. STRONGLY CONVEX CASE

For strongly convex case, we give results for both random reshuffling scheme and arbitrary scheme, with constant learning rate. Proof can be found in Appendix A.2.

Assumption 4.7. Function F in (P) is µ-strongly convex on dom(F).

Theorem 4.8. Suppose Assumptions 4.1, 4.2, 4.3 and 4.7 hold. Let { wt}T t=1 be generated by Algorithm 1 with random reshuffling scheme. For any 0 < δ < 1, we denote H := max{ 3σ2

δ }, G := sup{u 0|u2

2ℓ(2u) H}, G := p

2(1 + n A)G +

2nσ, L := ℓ(2G ).

For any ϵ > 0, if we choose ηt and T such that

ηt = η = 4 log( n T)

1 nδϵ, T log( n T)

2(3A + 2), Lσ r 8 nµδϵ, 3 s

then for any 0 < δ < 1, with probability at least 1 δ we have F(w(T +1) 0 ) F ϵ.

In Theorem 4.8, we can achieve a complexity of e O n p 1

2 outer iterations and e O n p+1

2 total gra-

dient evaluations with η = e O n 1 p

2 ϵ 1 2 , ignoring constants. This matches the result in Nguyen et al. (2021) with the same assumptions in the degenerate case of p = 0. The dependence on δ is T = O(δ 1

It is not hard to follow proof of Theorem 4.6 for arbitrary scheme in strongly convex case and achieve a complexity of e O n p 2 +1ϵ 1

2 total gradient evaluations. Here we give a slightly stronger result where we remove Assumption 4.3 to match the corresponding result in Lipschitz smooth case.

Theorem 4.9. Suppose Assumptions 4.1, 4.2 and 4.7 hold. Let { wt}T t=1 be generated by Algorithm 1 with arbitrary scheme. We denote S = {w|F(w) F(w(1) 0 )}, G = maxw{ f(w; i) |w S, i [n]}, L := ℓ(2G ). For any ϵ > 0, choose ηt and T such that

ηt = η = 6 log(T)

9(µ2 + L2)σ2 ,

T = e O(ϵ 1

2 ) 12L2 log(T)

where σ is the standard deviation at w . Then we have F(w(t) 0 ) G and

F(w(T +1) 0 ) F ϵ.

In Theorem 4.9, we achieve a complexity of e O ϵ 1/2

outer iterations and e O nϵ 1/2 total gradient evaluations

with η = e O n 1ϵ 1 2 , ignoring constants. It should be noted that the constant G here is implicitly determined by constant p and can potentially be large. Therefore, the complexity here cannot be directly compared with those in Theorem 4.6 or 4.8.

4.1.3. NON-STRONGLY CONVEX CASE

Next we consider the case where only non-strongly convexity are assumed. In the following theorem, we assume the

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

optimal solution exists and denote one as w , the standard

deviation at w as σ := q

1 n Pn i=1 f(w ; i) 2 and the

average value of {w(t) 0 }T t=1 as w T = 1

T PT t=1 w(t) 0 . Proof for this section can be found in Appendix A.3. Assumption 4.10. Functions f( ; i) in (P) are convex ondom(F), for all i [n]. Theorem 4.11. Suppose Assumptions 4.1, 4.2, 4.3 and 4.10 hold. Let { wt}T t=1 be generated by Algorithm 1 with random reshuffling scheme. For any 0 < δ < 1, define H, G, G , L as in Theorem 4.4. For any ϵ > 0, choose ηt = η and T such that

n + 1 , 3 r

n 1 Tσ2L2 , 3 s

3n w(1) 0 w 2

T 4 w(1) 0 w 2

then with probability at least 1 δ, we have F(w(t) 0 ) G for every 1 t T and

F( w T ) F ϵ.

By choosing η = O

2 ϵ0.5 , we

achieve a complexity of O n p 1

outer iterations and

total number of gradient evaluations, ignoring

constants. The dependency on δ is T = O(δ 3

2 p 2 p ). If ϵ 1/n, one possible stepsize is η = nϵ 2L

Similarly, we can follow proof of Theorem 4.6 for arbitrary

scheme and achieve a complexity of O n p 2 +1

ϵ1.5 total gradient evaluations. Now we give a result without variance assumption 4.3. Theorem 4.12. Suppose Assumptions 4.1, 4.2 and 4.10 hold. Let { wt}T t=1 be generated by Algorithm 1 arbitrary scheme. Define S = {w|F(w) F(w(1) 0 )}, G = maxw{ f(w; i) |w S, i [n]} < , L = ℓ(2G ). For any ϵ > 0, choose ηt = η and T such that

3ϵ 2L, T = O(ϵ 1.5) w(1) 0 w 2

then we have F(w(t) 0 ) G for every 1 t T and

min t [T ] F(w T ) F ϵ.

By choosing η = O( 3q

1 T ), we have the complexity of

O( 1 ϵ1.5 ) outer iterations and O( n

ϵ1.5 ) total number of gradient evaluations, ignoring constants. One possible stepsize is η =

4.2. Proof Sketch and Technical Novelty

Broadly speaking, our approach involves two main goals: first, demonstrating that Lipschitz smoothness is maintained with high probability along the training trajectory { wt}, and second, showing that, conditioned on Lipschitz smoothness, the summation of gradient norms is bounded with high probability. Here we slightly abuse the term Lipschitz and Lipschitz smoothness to refer to the property between neighboring steps along the training trajectory.

For the first goal, in Lemma A.4, we prove by induction that when starting an iteration with a bounded gradient, the entire training trajectory during this iteration will have bounded gradients. Consequently, we only need to verify the Lipschitz smoothness condition at the start of each iteration. However, at this point, the two goals become intertwined. We need Lipschitz smoothness to bound the gradient differences, but we also need the gradient norm bounds to establish Lipschitz smoothness. Our solution is to address both issues simultaneously.

Assuming that, before a stopping time τ, Lipschitz smoothness holds, we bound the gradient norm up to that time in Lemma A.5. However, this process is nontrivial. Since we are examining behavior before a stopping time, every expectation is now conditioned on t < τ, rendering all previous estimations for shuffling gradient algorithms inapplicable. This presents a contradiction: we want to condition on t < τ when applying Lipschitz smoothness, but we do not want this condition when estimating other quantities. In Lemma A.5, we find a method to separately handle these two requirements, allowing us to achieve both goals simultaneously.

4.3. Limitations and Future works

Although we have proved upper bounds for the complexity of shuffling gradient algorithms, there are certain limitations in our work that we leave for future research:

First, as is common with many optimization algorithms, it is challenging to verify that the bounds presented are indeed the lower bounds. Future work could explore improving these results, for instance, by reducing the dependency on δ to a logarithmic factor, or by proving that the current bounds are, in fact, tight lower bounds.

Second, although we showed results for arbitrary shuffling schemes, there are better results for single shuffling under Lipschitz smoothness, for example Ahn et al. (2020) proved O( 1 n T 2 ) convergence rate for strongly convex objectives. It is interesting to see whether we can achieve the same convergence rate with ℓ-smoothness as well.

The results in Theorem 4.9 and Theorem 4.12 depends on constant G that can be potentially very large and

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

hard to verify. In the absence of both Lipschitz smoothness and bounded variance, the behavior of gradients can be hard to track. We hope our results here can be a first step for future work.

Lastly, shuffling gradient methods have been integrated with variance reduction techniques (Malinovsky et al., 2023). Exploring the performance of these algorithms under relaxed smoothness assumptions is another promising direction for future work.

5. Numerical Experiments

We compare reshuffling gradient algorithm (Algorithm 1) with SGD on multiple ℓ-smooth optimization problems to prove its effectiveness. Experiments are conducted with different shuffling schemes, on convex, strongly convex and nonconvex objective functions, including synthetic functions, phase retrieval, distributionally robust optimization (DRO) and image classification.

5.1. Convex and Strongly Convex Settings

We first consider convex functions fi,k(x) = x4 i + kxi of x R50 for all (i, k) E := {1, 2, . . . , 50} { 10, 9, . . . , 9, 10}, as well as their sample average f(x) = 1 1050 P

(k,i) E fi,k(x) = 1 50 P50 i=1 x4 i . It can be easily verified that f and all fi,k are convex but not strongly convex, and ℓ-smooth (with ℓ(u) = 3u2/3) but not Lipschitz-smooth. Then we compare reshuffling gradient algorithm (Algorithm 1) with SGD on the objective minx R50 f(x). Specifically, for each SGD update x x η fk,i(x), (k, i) E is obtained uniformly at random. For Algorithm 1, we adopt three shuffling schemes as elaborated in Section 3. The fixed-shuffling scheme and shuffling-once fix all permutations π(t) respectively to be the natural sequence (1, 10), (1, 9), . . . (50, 10) and its random permutation at the beginning, while the uniformshuffling scheme obtains permutations π(t) uniformly at random and independently for all iterations t. We implement each algorithm 100 times with initialization x0 = [1, . . . , 1] and fine-tuned stepsizes 0.01 (i.e., η = 0.01 for SGD and ηt

n = 0.01 for Algorithm 1), which takes around 3 minutes in total. We plot the learning curves of f(xt) averaged among the 100 times, as well as the 95% and 5% percentiles in the left of Figure 1, which shows that Algorithm 1 with all shuffling schemes converges faster than SGD.

Then we consider strongly convex functions fj,k(x) = exp(xj k) + exp(k xj) + 1

2||x||2 for (i, k) E and

Figure 1: Experimental Results on Convex (up) and Strongly-convex (down) Objective Functions.

their sample average below.

f(x) = 1 1050

(k,i) E fi,k(x) = 1

exp(n + 1) exp( n)

1050[exp(1) 1]

j=1 [exp(xj) + exp( xj)].

All these functions fj,k and f are 1-strongly convex and ℓsmooth (with ℓ(u) = 5u + 5) but not Lipschitz-smooth. We repeat the experiment in the same procedure above, except that all the stepsizes are fine-tuned to be 10 5. The result is shown in the right of Figure 1, which also shows that Algorithm 1 with all shuffling schemes converges faster than SGD.

5.2. Application to Phase Retrieval and DRO

We compare SGD with Algorithm 1 on phase retrieval and distributionally robust optimization (DRO), which are ℓsmooth but not Lipschitz smooth. We use similar setup as in (Chen et al., 2023).

In the phase retrieval problem (2), we select m = 3000 and d = 100, and generate independent Gaussian variables x, ar N(0, 0.5Id), initialization z0 N(5, 0.5Id), as well as yi = |a r z|2 + ni with noise ni N(0, 42) for i = 1, ..., m. We select constant stepsizes 2 10 6 and η(t) j 0.007

m for SGD and Algorithm 1 respectively by fine-tuning and implement each algorithm 100 times. For Algorithm 1,

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

Figure 2: Experimental Results on Phase Retrieval (up) and DRO (down).

we adopt three shuffling schemes as elaborated in Section 3. The fixed-shuffling scheme and shuffling-once fix all permutations π(t) respectively to be the natural sequence 1, 2, . . . , 3000 and its random permutation at the beginning, while the uniform-shuffling scheme obtains permutations π(t) uniformly at random and independently for all iterations t. We plot the learning curves of the objective function values averaged among the 100 times, as well as the 95% and 5% percentiles in the left of Figure 2, which shows that Algorithm 1 with shuffle-once and uniform-shuffling schemes converge faster than SGD.

In the DRO problem (1), we select λ = 0.01 and ψ (t) = 1 4(t+2)2 + 1 (corresponding to ψ being χ2 divergence). For the stochastic samples ξ, we use the life expectancy data1

designed for regression task between the life expectancy (target) and its factors (features) of 2413 people, and preprocess the data by filling the missing values with the median of the corresponding features, censorizing and normaliz-

1https://www.kaggle.com/datasets/ kumarajarshi/life-expectancy-who?resource= download

Figure 3: Experimental Results on Cifar 10 Dataset.

ing all the features 2, removing two categorical features ( country and status ), and adding standard Gaussian noise to the target to get robust model. We use the first 2000 samples {xi, yi}2000 i=1 with features xi R34 and targets yi R for training. We use the loss function ℓξ(w) = 1 2(yξ x ξ w)2 + 0.1 P34 j=1 ln 1 + |w(j)|) of w = [w(1); . . . ; w(34)] R34 for any sample xξ, yξ. We use initialization η0 = 0.1 and w0 R34 from standard Gaussian distribution.

Then similar to phase retrieval, we implement both SGD and the three sampling schemes of Algorithm 1 100 times with stepsizes η(t) j = ηt

n = 10 7.We evaluate Ψ(xt) := minη R L(xt, η) every 10 iterations. The average, 5% and 95% percentiles of Ψ(xt) among the 100 implementations are plotted in the right of Figure 2, which shows that Algorithm 1 with fixed shuffling converges faster than SGD.

5.3. Application to Image Classification

We train Resnet18 (He et al., 2016) with cross-entropy loss for image classification task on Cifar 10 dataset (Krizhevsky, 2009), using SGD and Algorithm 1 with three shuffling schemes. We implement each algorithm 100 times with batchsize 200 and stepsize 10 3. After every 250 iterations, we evaluate the sample-average loss value as well as clas-

2The detailed process of filling missing values and censorization: https://thecleverprogrammer.com/2021/01/ 06/life-expectancy-analysis-with-python/

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

sification accuracy on the whole training dataset and test dataset. The average, 5% and 95% percentiles of these evaluated metrics among the 100 implementations are plotted in Figure 3, which shows that Algorithm 1 with fixed-shuffling scheme outperforms SGD on both training and test data, and Algorithm 1 with the other two shuffling schemes outperforms SGD on training data.

6. Conclusion

We revisited the convergence of shuffling-type gradient algorithms under relaxed smoothness assumptions, establishing their convergence for nonconvex, strongly convex, and nonstrongly convex settings. By introducing a more general smoothness condition, we demonstrated that these methods achieve competitive convergence rates without requiring Lipschitz smoothness, extending their applicability to a broader range of optimization problems. Our analysis covers both random reshuffling and arbitrary shuffling schemes, showing that properly chosen step sizes can ensure efficient convergence in both cases. Numerical experiments further validate our theoretical findings, demonstrating that shuffling-type methods outperform SGD in non-Lipschitz scenarios. These results provide a foundation for future work on shuffling-based optimization, including its integration with variance reduction techniques, possible tighter bounds in certain situations and its application to large-scale machine learning problems.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgments

This work was partially supported by NSF IIS 2347592, 2348169, DBI 2405416, CCF 2348306, CNS 2347617.

K. Ahn, C. Yun, and S. Sra. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. Advances in Neural Information Processing Systems, 33:17526 17535, 2020.

Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199 (1):165 214, 2023.

L. Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. 2009. URL https:

//api.semanticscholar.org/Corpus ID: 16822133.

L. Bottou. Stochastic Gradient Descent Tricks, pages 421 436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-35289-8. doi: 10.1007/ 978-3-642-35289-8_25. URL https://doi.org/ 10.1007/978-3-642-35289-8_25.

Z. Chen, Y. Zhou, Y. Liang, and Z. Lu. Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. ar Xiv preprint ar Xiv:2303.02854, 2023.

M. Crawshaw, M. Liu, F. Orabona, W. Zhang, and Z. Zhuang. Robustness to unbounded smoothness of generalized signsgd. Advances in neural information processing systems, 35:9955 9968, 2022.

S. De and T. Goldstein. Efficient distributed SGD with variance reduction. In F. Bonchi, J. Domingo-Ferrer, R. Baeza-Yates, Z. Zhou, and X. Wu, editors, IEEE 16th International Conference on Data Mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain, pages 111 120. IEEE Computer Society, 2016. doi: 10.1109/ICDM.2016. 0022. URL https://doi.org/10.1109/ICDM. 2016.0022.

J. Drenth. Principles of protein X-ray crystallography. Springer Science & Business Media, 2007.

A. d Aspremont, D. Scieur, A. Taylor, et al. Acceleration methods. Foundations and Trends in Optimization, 5 (1-2):1 245, 2021.

M. Feurer and F. Hutter. Hyperparameter optimization. Automated machine learning: Methods, systems, challenges, pages 3 33, 2019.

S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization, 23(4):2341 2368, 2013.

M. Gürbüzbalaban, A. E. Ozdaglar, and P. A. Parrilo. Why random reshuffling beats stochastic gradient descent. Math. Program., 186(1):49 84, 2021. doi: 10.1007/S10107-019-01440-W. URL https://doi. org/10.1007/s10107-019-01440-w.

J. Z. Hao Chen and S. Sra. Random shuffling beats SGD after finite epochs. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2624 2633. PMLR, 2019. URL http://proceedings.mlr.press/ v97/haochen19a.html.

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

J. Jin, B. Zhang, H. Wang, and L. Wang. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34: 2771 2782, 2021.

A. Koloskova, N. Doikov, S. U. Stich, and M. Jaggi. Shuffle sgd is always better than sgd: improved analysis of sgd with arbitrary data orders. ar Xiv preprint ar Xiv:2305.19259, 2023.

A. Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, University of Toronto, 2009.

D. Levy, Y. Carmon, J. C. Duchi, and A. Sidford. Largescale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33: 8847 8860, 2020.

H. Li, J. Qian, Y. Tian, A. Rakhlin, and A. Jadbabaie. Convex and non-convex optimization under generalized smoothness. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.

H. Li, A. Rakhlin, and A. Jadbabaie. Convergence of adam under relaxed assumptions, 2023b.

G. Malinovsky, A. Sailanbayev, and P. Richtárik. Random reshuffling with variance reduction: New analysis and better rates. In Uncertainty in Artificial Intelligence, pages 1347 1357. PMLR, 2023.

J. Miao, P. Charalambous, J. Kirz, and D. Sayre. Extending the methodology of x-ray crystallography to allow imaging of micrometre-sized non-crystalline specimens. Nature, 400(6742):342 344, 1999.

K. Mishchenko, A. Khaled, and P. Richtárik. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33: 17309 17320, 2020.

A. S. Nemirovskij and D. B. Yudin. Problem complexity and method efficiency in optimization. 1983.

Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical programming, 140(1):125 161, 2013.

L. M. Nguyen, Q. Tran-Dinh, D. T. Phan, P. H. Nguyen, and M. Van Dijk. A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397 9440, 2021.

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by generative pretraining. 2018.

A. Reisizadeh, H. Li, S. Das, and A. Jadbabaie. Variancereduced clipping for non-convex optimization. ar Xiv preprint ar Xiv:2303.00883, 2023.

I. Safran and O. Shamir. How good is SGD with random shuffling? In J. D. Abernethy and S. Agarwal, editors, Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 3250 3284. PMLR, 2020. URL http://proceedings. mlr.press/v125/safran20a.html.

O. Shamir. Without-replacement sampling for stochastic gradient methods: Convergence results and application to distributed optimization. Co RR, abs/1603.00570, 2016. URL http://arxiv.org/abs/1603.00570.

R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

B. Wang, Y. Zhang, H. Zhang, Q. Meng, Z.-M. Ma, T.-Y. Liu, and W. Chen. Provable adaptivity in adam. ar Xiv preprint ar Xiv:2208.09900, 2022.

B. Wang, Y. Zhang, H. Zhang, Q. Meng, R. Sun, Z.-M. Ma, T.-Y. Liu, Z.-Q. Luo, and W. Chen. Provable adaptivity of adam under non-uniform smoothness. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2960 2969, 2024.

W. Xian, Z. Chen, and H. Huang. Delving into the convergence of generalized smooth minimax optimization. In Forty-first International Conference on Machine Learning.

B. Zhang, J. Jin, C. Fang, and L. Wang. Improved analysis of clipping algorithms for non-convex optimization. Advances in Neural Information Processing Systems, 33: 15511 15521, 2020.

J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. ar Xiv preprint ar Xiv:1905.11881, 2019.

Q. Zhang, P. Xiao, K. Ji, and S. Zou. On the convergence of multi-objective optimization under generalized smoothness. ar Xiv preprint ar Xiv:2405.19440, 2024.

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

A. Appendix / supplemental material

A.1. Nonconvex Case Analysis

In this section we prove the theorems in section 4.1.1.

A.1.1. LEMMAS

In this part we use notations as defined in Theorem 4.4, for completeness we repeat them here:

δ , G := sup{u 0|u2 2ℓ(2u) H} < ,

2(1 + n A)G +

2nσ, L := ℓ(2G ).

We first state some lemmas that are useful in our proof. The following lemma is a natural corollary of Definition 2.2, by the fact that ℓis non-decreasing.

Lemma A.1. If F is ℓ-smooth, for any w dom(F) satisfying F(w) G, we have B(w, G/ℓ(2G)) dom(F). For any w1, w2 B(w, G/ℓ(2G))), F(w1) F(w2) ℓ(2G) w1 w2 ,

F(w1) F(w2) + F(w2), w1 w2 + ℓ(2G)

The following lemma gives relationship between f(w; i) and F(w) .

Lemma A.2. If Assumption 4.3 is true, we have

2(1 + n A) F(w) +

Proof. From Assumption 4.3 we have that

f(w; i) 2 2 f(w; i) F(w) 2 + 2 F(w) 2

i=1 f(w; i) F(w) 2 + 2 F(w) 2

2n A F(w) 2 + 2nσ2 + 2 F(w) 2

= 2(1 + n A) F(w) 2 + 2nσ2.

Taking square root on both sides and notice that F(w) 0, σ > 0 we have the conclusion.

According to Lemma A.2, for w such that F(w) G is true, we have

2(1 + n A)G +

holds for all i [n].

In our proof, we want that with high probability, L-Lipschitz smoothness in Lemma A.1, for both F(w) and f(w; i), between w(t) 0 and w(t) j is true, for t [T], i, j [n]. For that purpose, we can prove the following inequalities with high probability, for t [T]:

F(w(t) 0 ) G;

w(t) 0 w(t) n G /ℓ(G + G );

f(w(t) 0 ; i) G , i [n];

w(t) 0 w(t) j G /ℓ(2G ), j [n]. (4)

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

By Lemma A.2 we know that the third inequality in (4) holds if the first inequality is true. Noticing that ℓ(G+G ) ℓ(2G ), it suffices to prove that, for t [T],

F(w(t) 0 ) G, w(t) 0 w(t) j G /ℓ(2G ), j [n]. (5)

For the first inequality, it can be hard to bound the gradient norm directly. The following lemma states the connection between gradient norm and function value of an ℓ-smooth function.

Lemma A.3. (Lemma 3.5 in Li et al. (2023a)) If F is ℓ-smooth, then

F(w) 2 2ℓ(2 F(w) ) (F(w) F )

for any w dom(F).

Since ℓis sub-quadratic, with Lemma A.3 we can bound the gradient norm by bounding the difference between the function value and the optimal value. To ease the proof, let us define the following stopping time:

τ := min{t|F(w(t) 0 ) F > H} (T + 1).

For t < τ, we have F(w(t) 0 ) G based on the definition of τ and Lemma A.3, so the first inequality in (5) is satisfied. The following lemma proves that the other inequality in (5) is true for t < τ as well, therefore guarantees the Lipschitz smoothness before τ.

Lemma A.4. For t < τ, ηt 1 2L, we have for all k [n] and t [T], w(t) 0 w(t) k 2 G /ℓ(2G ).

Proof. We use induction to prove that

w(t) j B(w(t) 0 , G

ℓ(2G )), j = 0, 1, . . . , n.

First of all, this claim is true for j = 0. Now suppose the claim is true for j k 1, i.e.,

w(t) 0 w(t) j G

ℓ(2G ), j = 0, 1, . . . , k 1,

we try to prove it for w(t) k . From Lemma A.1, we have Lipschitz smoothness, for all f(w; i), between w(t) 0 and w(t) j , if j k 1.

Since we have f(w(t) 0 ; i) G , i [n],

for any i [n] and j [k 1] we have

f(w(t) j ; i) f(w(t) 0 ; i) + f(w(t) j ; i) f(w(t) 0 ; i) G + L w(t) j w(t) 0 2G .

Hence, by the algorithm design we have

w(t) k w(t) 0 =

n f(w(t) j ; π(t) j )

n f(w(t) j ; π(t) j )

where the third inequality uses k n and ηt 1 2L. By induction, the claim is true.

Therefore, we have the desired Lipschitz smoothness property in Lemma A.1 for t < τ. Our next target is to bound P(τ T).

To simplify the notations, let us define

ϵ(t) k := 1

j=0 ( f(w(t) j ; π(t) j+1) f(w(t) 0 ; π(t) j+1))

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

as the average of differences between the gradients at the start of iteration t and the actual gradients we used until step j in the t-th outer iteration. It is worth mentioning that the actual step in t-th outer iteration is ηt[ F(w(t) 0 ) + ϵ(t) n ].

Now we bound the probability of event {τ T} by bounding the expectation of function value at the stopping time.

Lemma A.5. With parameters chosen in Theorem 4.4, we have

E[F(w(τ) 0 ) F ] 2 1.

Proof. For any t < τ,

F(w(t+1) 0 ) F(w(t) 0 )

F(w(t) 0 ), w(t+1) 0 w(t) 0 + L

2 w(t+1) 0 w(t) 0 2

= ηt F(w(t) 0 ), F(w(t) 0 ) + ϵ(t) n + Lη2 t 2 F(w(t) 0 ) + ϵ(t) n 2

2 ( F(w(t) 0 ) 2 + F(w(t) 0 ) + ϵ(t) n 2 ϵ(t) n 2) + Lη2 t 2 F(w(t) 0 ) + ϵ(t) n 2

2 F(w(t) 0 ) 2 + ηt

2 F(w(t) 0 ) 2 + ηt L2

k=0 w(t) k w(t) 0 2. (6)

Here the first and last inequalities are from Lemma A.1 and the second is because ηt 1 2L. Taking summation from t = 1 to t = τ 1 and taking expectation we have

E[F(w(τ) 0 ) F ] 1 E[

2 F(w(t) 0 ) 2] + E[

k=0 w(t) k w(t) 0 2]. (7)

Now let us get a bound for the last term on the right hand side. For any t [T], k [n], from Algorithm 1 and Cauchy-Schwarz inequality we have

w(t) k w(t) 0 2 =k2η2 t n2

j=0 f(w(t) j ; π(t) j+1) 2

3k2η2 t n2 1

j=0 ( f(w(t) 0 ; π(t) j+1) F(w(t) 0 )) 2 + 3k2η2 t n2 F(w(t) 0 ) 2

+ 3kη2 t n2

j=0 f(w(t) j ; π(t) j+1) f(w(t) 0 ; π(t) j+1) 2.

Let us denote the 3 terms on the RHS as A1(t, k), A2(t, k) and A3(t, k), i.e. w(t) k w(t) 0 2 A1(t, k)+A2(t, k)+A3(t, k). Since we are interested in E[Pτ 1 t=1 ηt L2

2n Pn 1 k=0 w(t) k w(t) 0 2], we need to bound E[Pτ 1 t=0 ηt L2

2n Pn 1 k=1 Ai(t, k)] for i = 1, 2, 3.

For A1(t, k), since π(t) is randomly chosen, let Ft := σ(π(1), , π(t)) be the σ-algebra generated in Algorithm 1, for

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

t [T] we have

k=0 A1(t, k)|Ft 1] = ηt L2

3k2η2 t n2 E[ 1

j=0 f(w(t) 0 ; π(t) j+1) F(w(t) 0 ) 2|Ft 1]

3k2η2 t n2 n k k(n 1) 1 n

i=0 f(w(t) 0 ; i + 1) F(w(t) 0 ) 2

3η2 t k(n k) n2(n 1) (A F(w(t) 0 ) 2 + σ2)

2n (A F(w(t) 0 ) 2 + σ2).

Here the second equation comes from variance of randomized reshuffling variables, (Lemma 1 in Mishchenko et al. (2020)); the first inequality is from assumption 4.3; the last inequality is because Pn 1 k=0 k(n k) = (n 1)n(n+1)

Let {Zt}t T be a sequence such that Z1 = 0 and for any t [2, T],

Zt Zt 1 = η3 t L2

2n (A F(w(t) 0 ) 2 + σ2) + ηt L2

k=0 A1(t 1, k).

We know {Zt} is a supermartingale. Since τ is a bounded stopping time, by optional stopping theorem, we have E[Zτ] E[Z1], which leads to

k=0 A1(t, k)] E[

2n (A F(w(t) 0 ) 2 + σ2)].

For A2(t, k), for any t [T], taking summation over k we have Pn 1 k=0 A2(t, k) nη2 t F(w(t) 0 ) 2, therefore

k=0 A2(t, k)] E[

2 F(w(t) 0 ) 2].

For A3(t, k), for any t < τ, by Lemma A.1 we have

A3(t, k) 3k L2η2 t n2

j=0 w(t) j w(t) 0 2.

Taking summation over k, taking expectation we have

k=0 A3(t, k)] E[

j=0 w(t) j w(t) 0 2].

Now putting these together, we have

j=0 w(t) j w(t) 0 2] E[

2n (A F(w(t) 0 ) 2 + σ2)] + E[

2 F(w(t) 0 ) 2]

j=0 w(t) j w(t) 0 2].

Since ηt 1 2L < 1

3L we have 3η3 t L4

4n , rearranging the terms we have

j=0 w(t) j w(t) 0 2] E[

t=1 2η2 t σ2] + 2n E[

t=1 η2 t (A

n + 1) F(w(t) 0 ) 2]. (8)

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

Put this into (7) we have,

E[F(w(τ) 0 ) F ]

1 + E h τ 1 X

2 F(w(t) 0 ) 2 + ηt L2

j=0 w(t) j w(t) 0 2 i

(8) 1 + E h τ 1 X

n + 1)η3 t L2) F(w(t) 0 ) 2 i

1 + E h τ 1 X

4 F(w(t) 0 ) 2 i (9)

Here the third inequality is from ηt 1 2L

A n +1 and the last inequality is because ηt > 0 and τ T + 1.

Since PT t=1 η3 t n 1

σ2L2 , we have E[F(w(τ) 0 ) F ] 2 1.

Now we can bound the probability that τ = T + 1. Lemma A.6. With the parameters in Theorem 4.4, we have

P(τ T) δ/2.

Proof. From Lemma A.5 and the value of H we have

P(τ T) P(F(w(τ) 0 ) F > H) E[F(w(τ) 0 ) F ]

A.1.2. PROOF FOR THEOREMS IN NONCONVEX CASES

Proof for Theorem 4.4

Proof. From (9) we have

E[F(wτ 0) F ] + E h τ 1 X

4 F(w(t) 0 ) 2i 1 + L2σ2

t=1 η3 t 2 1. (10)

Therefore, since δ 1 we have

ηT E h τ 1 X

t=1 F(w(t) 0 ) 2i

P(τ = T + 1)E h T X

t=1 F(w(t) 0 ) 2|τ = T + 1]

t=1 F(w(t) 0 ) 2|τ = T + 1 i .

By Markov s inequality and our choice of T, we have

t=1 F(w(t) 0 ) 2 > ϵ2|τ = T + 1 16 1

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

From Lemma (A.6) we have P(τ T) δ

2. Therefore,

t=1 F(w(t) 0 ) 2 > ϵ2} {τ T}

P(τ T) + P 1

t=1 F(w(t) 0 ) 2 > ϵ2|τ = T + 1)

Since L = ℓ(2G ) = Ω(G p) = Ω(n p 2 ), with η = O( 3q

T ) and T = O( n p 1

2 ϵ3 ) we have the complexity.

The following lemma is useful in the proof of arbitrary scheme.

Lemma A.7. (lemma 6 in (Nguyen et al., 2021)) For t < τ and 0 < ηt 1 L

j=0 w(t) j w(t) 0 2 nη2 t [(3A + 2) F(w(t) 0 ) 2 + 3σ2].

Proof for Theorem 4.6

Proof. From inequality (6) we have for any t < τ,

F(w(t+1) 0 ) F(w(t) 0 )

2 F(w(t) 0 ) 2 + ηt L2

k=0 w(t) k w(t) 0 2

2 F(w(t) 0 ) 2 + η3 t L2[(3A + 2) F(w(t) 0 ) 2 + 3σ2] 2

4 F(w(t) 0 ) 2 + 3η3 t L2σ2

where the second inequality is from Lemma A.7 and the last inequality is from ηt 1 L

2(3A+2). Now taking summation of

t from 1 to τ 1 we have

F(w(τ) 0 ) F F(w(τ) 0 ) F +

4 F(w(t) 0 ) 2 1 + 3L2σ2

t=1 η3 t 2 1,

where the last inequality is because τ T + 1 and the choice of ηt. Therefore we have τ = T + 1 since H 2 1. On the other hand, we also have

t=1 F(w(t) 0 ) 2

t=1 F(w(t) 0 ) 2.

Therefore, we have 1 T

t=1 F(w(t) 0 ) 2 8 1

from our choice of T.

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

A.2. Strongly Convex Case Analysis

Lemma A.8. If we let H 3σ2

ϵ ) + 1 and ηt = η, we have τ 2 µη log( 4

Proof. From inequality (6) we have for t < τ

F(w(t+1) 0 ) F(w(t) 0 ) η

2 F(w(t) 0 ) 2 + ηL2

k=0 w(t) k w(t) 0 2

2 F(w(t) 0 ) 2 + η3L2[(3A + 2) F(w(t) 0 ) 2 + 3σ2] 2

where the last inequality is from η 1 L

2(3A+2) 1 2L. From the definition of τ we have

τ 1 + 8(H 1)

Proof for Theorem 4.8

Proof. From Lemma A.6 and the parameter choices we have P(τ T) δ

Now we try to bound F(w(τ) 0 ) F . In the strongly convex case, for t < τ we have

F(w(t+1) 0 ) F(w(t) 0 ) ηt

2 F(w(t) 0 ) 2 + L2ηt

j=0 w(t) j w(t) 0 2

F(w(t) 0 ) µηt

2 (F(w(t) 0 ) F ) ηt

4 F(w(t) 0 ) 2 + L2ηt

j=0 w(t) j w(t) 0 2,

here the first inequality is from (6) and the second one is from strongly convexity. We can rearrange the items and write the above inequality as

F(w(t+1) 0 ) F 1 µηt

(F(w(t) 0 ) F ) + L2σ2η3 t n + A(t), (11)

where A(t) is defined as

A(t) := L2ηt

j=0 w(t) j w(t) 0 2 ηt

4 F(w(t) 0 ) 2 L2σ2η3 t n . (12)

Let ηt = η := 4 log( n T )

µT , we want 1 µη

2 > 0, therefore we need T log( n T ) 2. Taking expectation and summation we have

E[F(w(τ) 0 ) F ] E[(1 µη

2 )τ 1 1] + 2L2σ2η2

nµ [1 (1 µη

2 )τ 1 t A(t)]

1E[exp( µητ/2)] + 2L2σ2η2

8 + 1 n T 2

1 + L2σ2 log2( n T)

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

where the second inequality is from 1 x exp( x) for x (0, 1) and the last inequality is from Lemma A.8, P(τ T) δ/2 and the value of η. Now if we look at the last item, we can notice from (8), by using η 1 L

2(3A+2) 1 2L

that we already have

t=1 A(t)] 0.

Therefore, we have

8 + 1 n T 2

1 + 8L2σ2 log2( n T)

E[F(w(τ) 0 ) F ]

P(τ = T + 1)E[F(w(T +1) 0 ) F |τ = T + 1]

2E[F(w(T +1) 0 ) F |τ = T + 1].

P(F(w(T +1) 0 ) F > ϵ|τ = T + 1) E[F(w(T +1) 0 ) F |τ = T + 1]

4 + 2 ϵn T 2

1 + 8L2σ2 log2( n T)

where the last line is from the constraint on T.

Proof for Theorem 4.9

Proof. The algorithm starts from w(1) 0 and we define S = {w|F(w) F(w(1) 0 )}. Since F is strongly-convex, we have S being compact. Therefore, we can define G = maxw{ f(w; i) |w S, i [n]} < .

If we have w(t) 0 S for all t [T], we have f(w(t) 0 ; i) G for t [T] and i [n]. On the other hand, by definition of F we have F(w(t) 0 ) G for t [T]. Therefore, by Lemma A.4 we have Lipschitz smoothness between w(t) 0 and w(t) j , for both F(w) and f(w; i), for t [T], i, j [n]. The rest of the proof then follows the one in Lipschitz smoothness case (theorem 1 in Nguyen et al. (2021)).

Now we prove that w(t) 0 S, for t T. The statement is obviously true for t = 1. Now for t [2, T], assume that we already proved the conclusion for 1, , t 1, we can use Lipschitz smoothness in the first t 1 iterations. Therefore, from theorem 1 in Nguyen et al. (2021) we have that

F(w(t) 0 ) F(w ) (1 ρη)t 1 1 + Dη2

where ρ = µ

3 , D = (µ2 + L2)σ2 . On the other hand, since η 1ρ2

(1 ρη)t 1 1 + Dη2

ρ (1 ρη) 1 + Dη2

Therefore, we have F(w(t) 0 ) F(w(1) 0 ), which means w(t) 0 S.

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

A.3. Non-strongly Convex Case Analysis

Proof for theorem 4.11

Proof. From Lemma A.6 we know P(τ T) < δ

For t < τ, if ηt = η, from lemma 7 in (Nguyen et al., 2021) we have that

w(t+1) 0 w 2 w(t) 0 w 2 2η[F(w(t) 0 ) F ] + 2Lη3

j=i f(w ; π(t) j+1) 2, (13)

where w is the optimal solution. If we denote A(t) := Pn 1 i=1 Pn 1 j=i f(w ; π(t) j+1) 2 and let σ := q

1 n Pn i=1 f(w ; i) 2, we have that for any t [T]

i=0 (n i)2E

j=i f(w ; π(t) j+1 F(w )

(n i)2i n(n i)(n 1)

j=0 f(w ; π(t) j+1) 2

= n(n + 1)σ2 6 .

By optional stopping theorem we know that

A(t) n(n + 1)σ2 6

Taking summation from t = 0 to τ 1 for (13) and taking expectation we have

t=1 (F(w(t) 0 ) F )] w(1) 0 w 2 + 2Lη3

n(n + 1)σ2 6 ]

w(1) 0 w 2 + 2LTη3σ2 3n ,

where the second inequality uses τ T + 1. Therefore, we have

w(1) 0 w 2 + 2LTη3σ2 3n

t=1 (F(w(t) 0 ) F )]

t=1 (F(w(t) 0 ) F )|τ = T + 1].

If we define w T = 1

T PT t=1 w(t) 0 , from convexity we have

F( w T ) F 1

t=1 [F(w(t) 0 ) F ].

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

Consider the event F := {F( w T ) F > ϵ}, we have

P(F|τ = T + 1) P( 1

t=1 (F(w(t) 0 ) F ) > ϵ|τ = T + 1)

E h PT t=1(F(w(t) 0 ) F )|τ = T + 1 i

w(1) 0 w 2 + 2LTη3σ2 3n

2 ηTϵ w(1) 0 w 2

where the last two inequalities are from the choices of η and T, separately.

Proof for Theorem 4.12

Proof. Similar to Theorem 4.9, if we have w(t) 0 S for t [T], we have the desired Lipschitz smoothness.

Now we prove the conclusion by trying to prove that w(t) 0 S for t [T]. The statement is obviously true for t = 1. Now for t [2, T], assume that we already proved the conclusion for 1, , t 1, we can use Lipschitz smoothness in the first t 1 iterations. Therefore, from (13) we have

w(t) 0 w 2 w(t 1) 0 w 2 2η[F(w(t 1) 0 ) F ] + 2Lη3

j=i f(w ; π(t 1) j+1 ) 2.

If F(w(t 1) 0 ) F ϵ, we have the desired conclusion.

If F(w(t 1) 0 ) F > ϵ, since η 1 G q

L , we have

w(t+1) 0 w 2 w(t) 0 w 2 2ηϵ + 2Lη3

i=1 (n i)2G 2

w(t) 0 w 2 2ηϵ + 2Lη3

w(t) 0 w 2.

Therefore, if F(w(t) 0 ) F ϵ for t [T], we have w(t) 0 S for t [T]. Taking summation we have that

t=1 [F(w(t) 0 ) F(w )] w(1) 0 w 2 + 2LG 2η3T

Therefore we have 1 T

t=1 [F(w(t) 0 ) F(w )] 1 2ηT

w(1) 0 w 2 + 2LG 2η3T

However, this contradict the assumption that F(w(t) 0 ) F ϵ for t [T]. Therefore, there must be t [T] such that F(w(t) 0 ) F ϵ.