# lsvrg_and_lkatyusha_with_adaptive_sampling__a44390f8.pdf Published in Transactions on Machine Learning Research (03/2023) L-SVRG and L-Katyusha with Adaptive Sampling Boxin Zhao boxinz@uchicago.edu Boxiang Lyu blyu@chicagobooth.edu Mladen Kolar mkolar@chicagobooth.edu The University of Chicago Booth School of Business Reviewed on Open Review: https: // openreview. net/ forum? id= 9lyqt3rb Dc Stochastic gradient-based optimization methods, such as L-SVRG and its accelerated variant L-Katyusha (Kovalev et al., 2020), are widely used to train machine learning models. The theoretical and empirical performance of L-SVRG and L-Katyusha can be improved by sampling observations from a non-uniform distribution (Qian et al., 2021). However, designing a desired sampling distribution requires prior knowledge of smoothness constants, which can be computationally intractable to obtain in practice when the dimension of the model parameter is high. To address this issue, we propose an adaptive sampling strategy for L-SVRG and L-Katyusha that can learn the sampling distribution with little computational overhead, while allowing it to change with iterates, and at the same time does not require any prior knowledge of the problem parameters. We prove convergence guarantees for L-SVRG and L-Katyusha for convex objectives when the sampling distribution changes with iterates. Our results show that even without prior information, the proposed adaptive sampling strategy matches, and in some cases even surpasses, the performance of the sampling scheme in Qian et al. (2021). Extensive simulations support our theory and the practical utility of the proposed sampling scheme on real data. 1 Introduction We aim to minimize the following finite-sum problem: min x Rd F(x) := 1 i=1 fi(x), (1) where each fi is convex, differentiable, and Li-smooth see Assumptions 1 and 2 in Section 3. The minimization problem in (1) is ubiquitous in machine learning applications, where fi(x) typically represents the loss function on the i-th data point of a model parameterized by x. We denote the solution to (1) as x . However, due to computational concerns, it is typically solved via a first-order method (Bottou et al., 2018). When the sample size n is large, computing the full gradient F(x) can be computationally expensive, and stochastic first-order methods, such as stochastic gradient descent (SGD) (Robbins & Monro, 1951), are the modern tools of choice for minimizing (1). Since SGD iterates cannot converge to the minimizer without decreasing the stepsize due to nonvanishing variance, a number of variance-reduced methods have been proposed, such as SAG (Schmidt et al., 2017), SAGA (Defazio et al., 2014), SVRG (Johnson & Zhang, 2013), and Katyusha (Allen-Zhu, 2017). Such methods can converge to the optimum of (1) even with a constant stepsize. In this paper, we focus on L-SVRG and L-Katyusha (Kovalev et al., 2020), which improve on SVRG and Katyusha by removing the outer loop in these algorithms and replacing it with a biased coin-flip. This change simplifies parameter selection, leads to better practical performance, and allows for clearer theoretical analysis. Stochastic first-order methods use a computationally inexpensive estimate of the full gradient F(x) when minimizing (1). For example, at the beginning of round t, SGD randomly selects it [n] according to Published in Transactions on Machine Learning Research (03/2023) a sampling distribution pt over [n], and forms an unbiased estimate fit(x) of F(x). Typically, the sampling distribution pt is the uniform distribution, pt = (1/n, , 1/n), for all t. However, using a nonuniform sampling distribution can lead to faster convergence (Zhao & Zhang, 2015; Needell et al., 2016; Qian et al., 2019; Hanzely & Richtárik, 2019; Qian et al., 2021). For instance, when the sampling distribution is p IS = (p IS 1 , , p IS n ), with p IS i = Li/(Pn i=1 Li) = Li/(n L), the convergence rate of L-SVRG and L-Katyusha can be shown to depend on the average smoothness L := (1/n) Pn i=1 Li, instead of the maximum smoothness Lmax := max1 i n Li (Kovalev et al., 2020). Sampling from a non-uniform distribution is commonly referred to as importance sampling (IS). While sampling observations from p IS can improve the speed of convergence, p IS depends on the smoothness constants {Li}i [n]. In general, these constants are not known in advance and need to be estimated, for example, by computing supx Rd λmax( 2fi(x)), i [n], where λmax( ) denotes the largest eigenvalue of a matrix. However, when the dimension d is large, it is computationally prohibitive to estimate the smoothness constants, except in some special cases such as linear and logistic regression. In this paper, we develop a method to design a sequence of sampling distributions that leads to the convergence rate of L-SVRG and L-Katyusha that depends on L, instead of Lmax, without prior knowledge of {Li}i [n]. Instead of designing a fixed sampling distribution, where pt p for all t, we design a dynamic sampling distribution that can change with iterations of the optimization algorithm. We follow a recent line of work that formulates the design of the sampling distribution as an online learning problem (Salehi et al., 2017; Borsos et al., 2019; Namkoong et al., 2017; Hanchi & Stephens, 2020; Zhao et al., 2021). Using the gradient information obtained in each round, we update the sampling distribution with minimal computational overhead. This sampling distribution is subsequently used to adaptively sample the observations used to compute the stochastic gradient. When the sequence of designed distributions is used for importance sampling, we prove convergence guarantees for L-SVRG, under both strongly convex and weakly convex settings, and for L-Katyusha under the strongly convex setting. These convergence guarantees show that it is possible to design a sampling distribution that not only performs as well as p IS but can also improve over it without using prior information. We focus on comparing with p IS as it is the most widely used fixed sampling distribution (Qian et al., 2021) and leads to the best-known convergence rates with fixed sampling distribution (Zhao & Zhang, 2015; Needell et al., 2016). Contributions. Our paper makes the following contributions. We propose an adaptive sampling algorithm for L-SVRG and L-Katyusha that does not require prior information, such as smoothness constants. This is the first practical sampling strategy for these algorithms. We prove convergence guarantees for L-SVRG under both strong and weak convexity, and for L-Katyusha under strong convexity, using a sequence of sampling distributions that changes with iterations. These theoretical results show when the sequence of sampling distributions performs as well as p IS, and even outperforms it in some cases. Our numerical experiments support these findings. We also show that the control variate technique in SVRG and adaptive sampling reduce variance from different aspects, as demonstrated in a simulation. We conduct extensive simulations to provide empirical support for various aspects of our theory and real data experiments to demonstrate the practical benefits of adaptive sampling. Given its low computational cost and superior empirical performance, we suggest that our adaptive sampling should be considered as the default alternative to the uniform sampling used in L-SVRG and L-Katyusha. Related work. Our paper contributes to the literature on non-uniform sampling in first-order stochastic optimization methods. Previous work, such as Zhao & Zhang (2015), Needell et al. (2016), and Qian et al. (2021), studied non-uniform sampling in SGD, stochastic coordinate descent, and L-SVRG and LKatyusha, respectively, but focused on sampling from a fixed distribution. In contrast, we allow the sampling distribution to change with iterates, which is important as the best sampling distribution changes with iterations. Shen et al. (2016) studied adaptive sampling methods for variance-reducing stochastic methods, such as SVRG and SAGA, but their approach requires computing all gradients { fi(xt)}n i=1 at each step, which is impractical. Our method only requires computing the stochastic gradient fit(xt). The sampling distribution can be designed adaptively using an online learning framework (Namkoong et al., 2017; Salehi et al., 2017; Borsos et al., 2018; 2019; Hanchi & Stephens, 2020; Zhao et al., 2021). We call this process adaptive sampling, and its goal is to minimize the cumulative sampling variance, which appears in the convergence rates of L-SVRG and L-Katyusha (see Section 3). More specifically, Namkoong et al. (2017) Published in Transactions on Machine Learning Research (03/2023) and Salehi et al. (2017) designed the sampling distribution by solving a multi-armed bandit problem with the EXP3 algorithm. Borsos et al. (2018) took an online convex optimization approach and made updates to the sampling distribution using the follow-the-regularized-leader algorithm. Borsos et al. (2019) considered the class of distributions that is a linear combination of a set of given distributions and used an online Newton method to update the weights. Hanchi & Stephens (2020) and Zhao et al. (2021) investigated nonstationary approaches to learning sampling distributions. Among these works, Zhao et al. (2021) is the only one that compared their sampling distribution to a dynamic comparator that can change with iterations without requiring stepsize decay. While our theory quantifies the effect of any sampling distribution on the convergence rate of L-SVRG and L-Katyusha, we use the OSMD sampler and Ada OSMD sampler from Zhao et al. (2021), as they lead to the best upper bound and yield the best empirical performance. Notation. For a positive integer n, let [n] := {1, , n}. We use to denote the l2-norm in the Euclidean space. Let Pn 1 = {x Rn : Pn i=1 xi = 1, xj 0, j [n]} be the (n 1)-dimensional simplex. For a symmetric matrix A Rd d, we use λmax(A) to denote its largest eigenvalue. For a vector x Rd, we use xj or x[j] to denote its j-th entry. For two sequences {an} and {bn}, an = O(bn) if there exists C > 0 such that |an/bn| C for all n large enough; an = Θ(bn) if an = O(bn) and bn = O(an) simultaneously. Organization of the paper. In Section 2, we introduce the algorithm for designing the sampling distribution. In Section 3, we give the convergence analysis. Extensive simulations that demonstrate various aspects of our theory are given in Section 4. Section 5 illustrates an application to real world data. Finally, we conclude the paper with Section 6. 2 AS-LSVRG and AS-LKatyusha To solve (1) using SGD, one iteratively samples it uniformly at random from [n] and updates the model parameter by xt+1 xt ηt fit(xt). However, due to the non-vanishing variance V[ fit(xt)], xt cannot converge to x unless one adopts a diminishing step size by letting ηt 0. To address this issue, LSVRG (Kovalev et al., 2020) constructs an adjusted estimated of the gradient gt = fit(xt) fit(wt) + F(wt), where wt is a control variate that is updated to xt with probability ρ in each iteration. Note that gt is still an unbiased estimate of F(xt). Since both xt and wt converge to x , we have V[gt] 0, and thus xt can converge to x even with a constant step size. L-Katyusha incorporates a Nesterov-type acceleration to improve the dependency of the computational complexity on the condition number under the strongly convex setting (Kovalev et al., 2020). Qian et al. (2021) investigated sampling it from [n] using a non-uniform sampling distribution to achieve faster convergence. Given the model parameter xt at iteration t, suppose that it is sampled from the distribution pt = (pt 1, . . . , pt n). Then gt = 1 npt it fit(xt) fit(wt) + F(wt) is an unbiased estimate of F(xt). The variance of gt is V gt = V t e pt F(xt) F(wt) 2 , V t e pt := 1 fi(xt) fi(wt) 2 . (2) We let V t (pt) := V [gt] be the sampling variance of the sampling distribution pt, and V t e (pt) be the effective variance. Therefore, in order to minimize the variance of gt, we can choose pt to minimize V t e (pt). Let pt = arg minp Pn 1 V t e (pt) be the oracle optimal dynamic sampling distribution at the t-th iteration, which has the closed form pt , i = fi(xt) fi(wt) Pn j=1 fj(xt) fj(wt) , i [n]. (3) However, we cannot compute pt in each iteration, since computing it requires knowledge of all { fi(xt)}n i=1 and { fi(wt)}n i=1. If that were the case, we could simply use full-gradient descent, and there would be no Published in Transactions on Machine Learning Research (03/2023) Algorithm 1 AS-LSVRG 1: Input: stepsizes {η}t 1, ρ (0, 1]. 2: Initialize: x0 = w0; p0 = (1/n, , 1/n). 3: for t = 0, 1, , T 1 do 4: Sample it from [n] with pt = (pt 1, , pt n). 5: gt = 1 npt it ( fit(xt) fit(wt)) + F(wt). 6: xt+1 = xt ηtgt. ( xt with probability ρ, wt with probability 1 ρ. 8: Update pt to pt+1 by OSMD sampler (Algorithm 3) or Ada OSMD sampler (Algorithm 4). 9: end for Algorithm 2 AS-LKatyusha 1: Input: stepsizes {η}t 1, ρ (0, 1], θ1, θ2 [0, 1], 0 < κ < 1, L > 0. 2: Initialize: v0 = w0 = z0. 3: for t = 0, 1, , T 1 do 4: xt = θ1zt + θ2wt + (1 θ1 θ2)vt. 5: Sample it from [n] with pt = (pt 1, , pt n). 6: gt = 1 npt it ( fit(xt) fit(wt)) + F(wt). 7: zt+1 = 1 1+ηtκ ηtκxt + zt ηt 8: vt+1 = xt + θ1(zt+1 zt). ( vt with probability ρ, wt with probability 1 ρ. 10: Update pt to pt+1 by OSMD sampler (Algorithm 3) or Ada OSMD sampler (Algorithm 4). 11: end for need for either sampling or control variate. Therefore, some kind of approximation of pt is unavoidable for practical purposes. Qian et al. (2021) proposed substituting each fi(xt) fi(wt) with its upper bound. Based on the smoothness assumption (Assumption 2 in Section 3), we have fi(xt) fi(wt) Li xt wt . Thus, by substituting fi(xt) fi(wt) with Li xt wt in (2), we obtain an approximate sampling distribution p IS = (p IS 1 , , p IS n ), with p IS i = Li/(Pn i=1 Li) = Li/(n L). L-SVRG and L-Katyusha that use p IS can achieve faster convergence compared to using uniform sampling (Qian et al., 2021). However, one difficulty of applying p IS in practice is that we need to know Li for all i = 1, . . . , n. While such information can be easy to access in some cases, such as in linear and logistic regression problems, it is generally hard to estimate, especially when the dimension of the model parameter is high. To circumvent this problem, recent work has formulated the design of the sampling distribution as an online learning problem (Salehi et al., 2017; Borsos et al., 2019; Namkoong et al., 2017; Hanchi & Stephens, 2020; Zhao et al., 2021). More specifically, at each iteration t, after sampling it with sampling distribution pt, we can receive information about fit(xt) fit(wt) . Although we cannot have fi(xt) fi(wt) for all i = 1, . . . , n, the partial information obtained from { fis(xs) fis(ws) }t s=0 and {ps}t s=0 is helpful in constructing the sampling distribution pt+1 to minimize V t e (pt). In this paper, we adapt the methods proposed in Zhao et al. (2021) for L-SVRG and L-Katyusha and apply them in our experiments; however, our analysis is not restrictive to this choice and can fit other methods as well. We introduce our modifications of L-SVRG and L-Katyusha that use adaptive sampling, namely Adaptive Sampling L-SVRG (AS-LSVRG, Algorithm 1) and Adaptive Sampling L-Katyusha (AS-LKatyusha, Algorithm 2). The key change here is that instead of using a fixed sampling distribution pt p, t 0, we allow the sampling distribution to change with iterations and adaptively learn it. More specifically, Step 8 of Algorithm 1 and Step 10 of Algorithm 2 use OSMD sampler or Ada OSMD sampler (Zhao et al., 2021) to update the sampling distribution, which are described in Algorithm 3 and Algorithm 4, respectively. While Published in Transactions on Machine Learning Research (03/2023) Algorithm 3 OSMD sampler 1: Input: Learning rate η; parameter α (0, 1], A = PM 1 [α/M, )M; number of iterations T. 2: Output: pt for t = 1, . . . , T. 3: Initialize: p1 = (1/n, . . . , 1/n). 4: for t = 1, 2, . . . , T 1 do 5: Sample it from [n] by pt. Let at it = fit(xt) fit(wt) 2. 6: Compute the sampling loss gradient estimate ˆV t e (pt) Rn: all entries are zero except for the it-th entry, which is h ˆV t e (pt) i n2 at it (pt it)3 . (4) 7: Solve pt+1 = arg min p A η p, ˆV t e (pt) + DΦ p pt using Algorithm 5 with the learning rate η. Algorithm 4 Ada OSMD sampler 1: Input: Meta-algorithm learning rate γ; expert learning rates E = {η1 η2 ηH}; α (0, 1]; A = Pn 1 [α/n, )n. Number of iterations T. 2: Output: pt for t = 1, . . . , T. 3: Set θ1 h = (1 + 1/H)/(h(h + 1)), h [H]. 4: Initialize: p1 h = (1/n, . . . , 1/n) for h [H]. 5: for t = 1, 2, . . . , T 1 do 6: Compute pt = PH h=1 θt hpt h. 7: Sample it from [n] by pt. Let at it = fit(xt) fit(wt) 2. 8: for h = 1, 2, . . . , H do 9: Compute the sampling loss estimate ˆV t e (pt h; pt) = 1 n2 at it pt itpt h,it . (5) 10: Compute the sampling loss gradient estimate ˆV t e (pt h; pt) Rn: all entries are zero except for the it-th entry, which is h ˆV t e (pt h; pt) i n2 at it pt it(pt h,it)2 . (6) 11: Solve pt+1 h = arg minp A ηh p, ˆV t e (pt h; pt) + DΦ (p pt h) using Algorithm 5 with the learning rate ηh. 12: end for 13: Update the weights of each expert θt+1 h = θt h exp n γ ˆV t e (pt h; pt) o PH h=1 θt h exp n γ ˆV te (pt h; pt) o, h [H]. 14: end for the OSMD sampler and Ada OSMD sampler allow for choosing a mini-batch of samples in each iteration, here we focus on choosing only one sample in each iteration. We choose Φ to be the unnormalized negative entropy, that is, Φ(x) = Pn i=1 xi log xi Pn i=1 xi, x = (x1, . . . , xn) [0, )n, with 0 log 0 defined as 0. Additionally, DΦ (x y) = Φ(x) Φ(y) Φ(y), x y is the Bregman divergence between any x, y (0, )n with respect to the function Φ. Published in Transactions on Machine Learning Research (03/2023) Algorithm 5 OSMD Solver: Solve pt+1 = arg minq A η q, ˆut + DΦ(q pt) 1: Input: pt, ˆut, A = Pn 1 [α/n, )n. Learning rate η. 2: Output: pt+1. 3: Let pt+1 i = pt i exp ( ηˆut i) for i [n]. 4: Sort { pt+1 i }n i=1 in a non-decreasing order: pt+1 π(1) . . . pt+1 π(n). 5: Let vi = pt+1 π(i) 1 i 1 n α for i [n]. 6: Let zi = α n Pn j=i pt+1 π(j) for i [n]. 7: Find the smallest i such that vi > zi, denoted as i . 8: Let pt+1 i = ( α/n if π(i) < i (1 ((i 1)/n)α) pt+1 i / Pn j=i pt+1 π(j) otherwise. The key insight of the OSMD Sampler is to use Online Stochastic Mirror Descent (Lattimore & Szepesvári, 2020) to minimize the cumulative sampling loss PT t=1 V t e (pt), where V t e (pt) is defined in(2). To apply OSMD, we first construct an unbiased estimate of the gradient of V t e (pt), which is shown in (4). Then, in Step 7, we update the sampling distribution by taking a mirror descent. Intuitively, the optimization objective in Step 7 involves two terms. The first term encourages the sampling distribution to fit the most recent history, while the second term ensures that it does not deviate too far from the previous decision. By choosing a learning rate η, we keep a trade-off between these two concerns. A larger learning rate implies a stronger fit towards the most recent history. To automatically choose the best learning rate, Ada OSMD uses a set of expert learning rates and combines them using exponentially weighted averaging. Note that the total number of iterations T is assumed to be known and used as an input to Ada OSMD. When the number of iterations T is not known in advance, Zhao et al. (2021) proposed a doubling trick, which could also be used here. The set of expert learning rates is given by h = 1, 2, . . . , H 1 + 4 log(n/α) log n (T 1) + 1. (8) The learning rate in Ada OSMD is set to γ = α 8 T a1 , where a1 = maxi [n] fi(x0) . For all experiments in this paper, we set α = 0.4. The main computational bottleneck of both the OSMD sampler and the Ada OSMD sampler is the mirror descent step. Fortunately, Step 7 of Algorithm 3 and Step 11 of Algorithm 4 can be efficiently solved by Algorithm 5. The main cost of Algorithm 5 comes from sorting the sequence { pt+1 i }n i=1, which can be done with the computational complexity of O(n log n). However, note that we only update one entry of pt to get pt+1 and pt is sorted in the previous iteration. Therefore, most entries of pt+1 are also sorted. Using this observation, we can usually achieve a much faster running time, for example, by using an adaptive sorting algorithm (Estivill-Castro & Wood, 1992). 3 Convergence analysis We provide convergence rates for AS-LSVRG (Algorithm 1) and AS-LKatyusha (Algorithm 2), for any sampling distribution sequence {pt}t 0. We begin by imposing assumptions on the optimization problem in (1). Assumption 1 (Convexity). For each i [n], the function fi( ) is convex and first-order continuously differentiable: fi(x) fi(y) + fi(y), x y for all x, y Rd. Published in Transactions on Machine Learning Research (03/2023) Assumption 2 (Smoothness). For each i [n], the function fi is Li-smooth: fi(x) fi(y) Li x y for all x, y Rd. Furthermore, the function F is LF -smooth: F(x) F(y) LF x y for all x, y Rd. Recall that L = (1/n) Pn i=1 Li and Lmax = max1 i n Li. By the convexity of and Jensen s inequality, we have that LF L. For some results, we will assume that F is strongly convex. Assumption 3 (Strong Convexity). The function F( ) is µ-strongly convex: F(x) F(y) + F(y), x y + µ for all x, y Rd, where µ > 0. Additionally, the optimization heterogeneity is defined as i=1 fi(x ) 2, (9) and the smoothness heterogeneity is defined as Lmax/ L. 3.1 Convergence analysis of AS-LSVRG We begin by providing a convergence rate for AS-LSVRG (Algorithm 1) under strong convexity. Let fi(wt) fi(x ) 2 . (10) Roughly speaking, Dt measures the weighted distance between control-variates wt and the minimizer x , where the weights are the inverse of Lipschitz constants. Theorem 1. Suppose Assumptions 1-3 hold. Let ηt η for all t, where η 1/(6 L + LF ), and let α1 := max n 1 ηµ, 1 ρ E x T x 2 + 4η2 L ρ DT αT 1 E x0 x 2 + 4η2 L ρ D0 + η2 T X t=0 αT t 1 E V t e pt V t e p IS . See proof in Appendix A.1. From the convergence rate in Theorem 1, we observe that a good sampling distribution sequence should minimize the cumulative sampling variance PT t=0 αT t 1 E [V t e (pt)]. This justifies the usage of Ada OSMD to design a sequence of sampling distributions, as its purpose is to minimize the cumulative sampling variance (Zhao et al., 2021). When t=0 αT t 1 E V t e pt V t e p IS = O αT , (11) the iteration complexity to achieve ϵ-accuracy is O(1/(log(1/α1)) log(1/ϵ)). When ρ = 1/n, η = 1/(6 L+LF ), and both L/µ and n are large, this bound is O((n+ L/µ) log(1/ϵ)), which recovers the complexity of L-SVRG when sampling from p IS (Qian et al., 2021). Published in Transactions on Machine Learning Research (03/2023) When (11) holds, we can further compare the iteration complexity of AS-LSVRG with the iteration complexity of SGD with importance sampling from p IS, which is O((σ2 /(µ2ϵ) + L/µ) log(1/ϵ)), where σ2 is defined in (9) (Needell et al., 2016), and the iteration complexity of L-SVRG, which is O((n + Lmax/µ) log(1/ϵ)) (Kovalev et al., 2020). First, we observe that the iteration complexities of AS-LSVRG and L-SVRG do not depend on σ2 , while the iteration complexity of SGD does. This shows that the control-variate improves upon optimization heterogeneity. Second, we observe that both iteration complexities of AS-LSVRG and SGD depend on L, while the iteration complexity of L-SVRG depends on Lmax. This shows that adaptive sampling improves upon smoothness heterogeneity. Based on these two observations, we have the following important takeaway: While both the control-variate and adaptive sampling are reducing the variance of stochastic gradient, the control-variate is improving upon optimization heterogeneity, and adaptive sampling is improving upon smoothness heterogeneity. Another important observation is that when pt = pt , we have V t e (pt ) V t e p IS . Therefore, the performance of the oracle optimal dynamic sampling distribution is at least as good as the fixed sampling distribution p IS. The gains from using a dynamic sampling distribution can be significant, as we show in experiments in Section 4 and Section 5. While the closed form of pt in (3) requires knowledge of fi(xt) fi(wt), which is not available in practice, we can minimize the cumulative sampling variance PT t=1 V t e (pt) sequentially using Ada OSMD, which results in the approximation pt, without the need for prior information. We discuss in Section 3.3 below when this adaptive sampling strategy can perform better than p IS. The following result provides the convergence rate when F(x) is weakly convex. Theorem 2. Suppose Assumptions 1 and 2 hold. Let ηt η for all t, where η 1/(6LF ), and let ˆx T = (1/T) PT t=1 xt. Then E F(ˆx T ) F(x ) 4 T F(x0) F(x ) x0 x 2 + 12η L(1 ρ) 5ρ F(w0) F(x ) + 3η t=0 E V t e pt V t e p IS . See proof in Appendix A.2. In the weakly convex case, the cumulative sampling variance is defined as PT t=0 E [V t e (pt)], and a good sampling distribution sequence should minimize it. When η = 1/(6LF ), ρ = 1/n, and PT t=0 E V t e (pt) V t e p IS = O(T(LF + n)), the iteration complexity to reach ϵ-accuracy is O((LF + n)(1/ϵ)), which recovers the rate of L-SVRG when sampling from p IS Qian et al. (2021). 3.2 Convergence analysis of AS-LKatyusha We prove a convergence rate for AS-LKatyusha (Algorithm 2) under strong convexity. Let Zt := L(1 + ηtκ) F(vt) F(x ) , Wt := θ2(1 + θ1) F(wt) F(x ) , and Ψt := Zt + Vt + Wt. We then have the following theorem. See proof in Appendix A.3. Theorem 3. Suppose Assumptions 1-3 hold. Let ηt η for all t, where η = ((1 + θ2)θ1) 1θ2, and κ = µ/L with L = L. Let θ2 = 1/2, θ1 1/2, and α2 := max 1 1 + ηκ, 1 θ1 2 , 1 ρθ1 1 + θ1 Published in Transactions on Machine Learning Research (03/2023) E ΨT αT 2 Ψ0 + 1 4 Lθ1 t=0 αT t 1 2 E V t e pt V t e p IS . The cumulative sampling variance is defined as PT 1 t=0 αT t 1 2 E [V t e (pt)], and can be used as the minimization objective to design a sequence of sampling distributions. When ρ = 1/n, θ1 = min{ p 2κn/3, 1/2}, and PT 1 t=0 αT t 1 2 E V t e (pt) V t e p IS = O(αT 2 ), then the iteration complexity to reach ϵ-accuracy is O((n + q n L/µ) log(1/ϵ)), which recovers the rate of L-Katyusha when sampling from p IS Qian et al. (2021). Additionally, when compared with the rate of L-Katyusha Kovalev et al. (2020), we see that the dependency on Lmax is improved to L, which is consistent with our conclusion in Section 3.1 that adaptive sampling is responsible for improving smoothness heterogeneity. 3.3 Benefits of adaptive sampling We analyze when adaptive sampling will improve over sampling from p IS. We first emphasize that sampling from p IS requires knowledge of Lipschitz constants {Li}i [n], which, in general, are expensive to compute. On the other hand, the additional computational cost of adaptive sampling is usually comparable to the cost of computing a stochastic gradient. In addition to computational benefits, there are certain settings where adaptive sampling may result in improved convergence, despite not using prior information. A key quantity to understand is t=0 αT E V t e p IS V t e pt , where α {α1, α2, 1}, depending on the algorithm used and the assumptions made. The larger V p1:T is, the more beneficial adaptive sampling is. In the following, we discuss when V (p1:T ) is large. Although p1:T is not available in practice, V (p1:T ) can be used to understand when adaptive sampling methods that approximate pt will be superior to using p IS for importance sampling. In many machine learning applications, fi(x) has the form fi(x) = l(x, ξi), where ξi is the i-th data point. Let x i Rd be such that l(x i , ξi) = 0. Then fi(x) = l(x, ξi) l(x i , ξi) . This way, we see that the variability of norms of gradients of different data points has two sources: the first source is the difference between ξi s, the second source is the difference between x i s. We name the first source as the context-shift and the second source as the concept-shift. When fi(x) is twice continuously differentiable, we have Li = sup x Rd λmax 2fi(x) = sup x Rd λmax 2li(x, ξi) . Thus, when we use p IS to sample, we ignore the concept-shift and only leverage the context-shift with the sampling distribution. As a result, p IS is most useful when the context-shift dominates. On the other hand, adaptive sampling takes both the concept-shift and context-shift into consideration. When the major source of gradient norm differences is the concept-shift, adaptive sampling can perform better than sampling from p IS. This is illustrated in Section 4.3. 4 Synthetic data experiment We use synthetic data to illustrate our theory and compare several different stochastic optimization algorithms. We denote L-SVRG + uniform sampling as L-SVRG, L-SVRG + oracle optimal sampling as Optimal-LSVRG, and L-SVRG + sampling from p IS as IS-LSVRG. Similarly, we define SGD, Optimal-SGD, IS-SGD, L-Katyusha, Optimal-LKatyusha, and IS-LKatyusha. Additionally, AS-LSVRG and AS-LKatyusha refer to Algorithm 1 and Algorithm 2 with the Ada OSMD sampler (Algorithm 4), respectively, except in Section 4.4, where we use the OSMD Sampler (Algorithm 3). Published in Transactions on Machine Learning Research (03/2023) We set ρ = 1/n for all algorithms. The algorithm parameters for L-Katyusha with all sampling strategies are set according to Theorem 3, where L = L for Optimal-LKatyusha and IS-LKatyusha, and L = Lmax for L-SVRG. For AS-LKatyusha, we set L = 0.4Lmax + 0.6 L. As for the parameters of Ada OSMD, they are configured as stated in Section 2; when choosing a mini-batch of samples in each iteration, we set them according to Zhao et al. (2021). Data generation: We generate data from a linear regression model: bi = θ , ai + ζi, where ai i.i.d. N(0, si Σ) with Σ = diag(25 0 d 1 1, , 25 d 1 d 1 1) and si i.i.d. e N(0,ν2), ζi i.i.d. N(0, σ2), and the entries of θ are generated i.i.d. from N(10.0, 3.02). We let fi(x) := l(x; ai, bi), where l(x; ai, bi) := (1/2)(bi x, ai )2 is the square error loss. In this setting, the variance σ2 controls the optimization heterogeneity in (9), with larger σ2 corresponding to larger optimization heterogeneity, while ν controls the smoothness heterogeneity, with larger ν corresponding to larger smoothness heterogeneity. Under this model, the variability of the gradient norms is primarily caused by the differences between bi s, which corresponds to the context-shift. As a result, we expect that sampling according to p IS would perform similarly to oracle optimal sampling. Note that in this setting, we have Li = ai 2, so we set p IS i = ai 2/(Pn j=1 aj 2) for all i = 1, . . . , n. We set n = 100, d = 10, and report results averaged across 10 independent runs. 4.1 SGD v.s. L-SVRG We compare SGD and Optimal-SGD with L-SVRG and Optimal-LSVRG. From the results in Figure 1, we have three main observations. First, with large optimization heterogeneity (rightmost column), Optimal LSVRG converges faster and can achieve a smaller optimal value compared to Optimal-SGD. This observation is consistent with our conclusion in Section 3.1 that the control variate is responsible for improving optimization heterogeneity. Second, Optimal-LSVRG always improves the performance over L-SVRG, with the largest improvement observed when the smoothness heterogeneity is large (bottom row). This observation illustrates our conclusion that importance sampling can improve smoothness heterogeneity. Finally, we observe that L-SVRG is more vulnerable to the smoothness heterogeneity compared to SGD, which can also be seen from the condition on the step size: we need η 1/(6Lmax) for L-SVRG (Theorem 5 of Kovalev et al. (2020)) and we only need η 1/Lmax for SGD (Theorem 2.1 of Needell et al. (2016)) to ensure convergence. 4.2 Non-uniform sampling for L-SVRG and L-Katyusha We compare L-SVRG and L-Katyusha with different sampling strategies. Figure 2 shows results for L-SVRG. We observe that the performances of Optimal-LSVRG and IS-LSVRG are similar, since the context-shift dominates the variability of the gradient norms. Furthermore, we see that adaptive sampling improves the performance of L-SVRG compared to uniform sampling. The improvement is most significant when the smoothness heterogeneity is large (bottom row). Figure 3 shows results for L-Katyusha. We set the step size according to Theorem 3. The oracle optimal sampling distribution results in considerable improvement over sampling from p IS after adding acceleration. In addition, we note that adaptive sampling efficiently improves over uniform sampling. 4.3 Importance sampling v.s. adaptive sampling We provide an example where adaptive sampling can perform better than sampling from p IS. We generate data from a linear regression model bi = θ , ai + ζi, where ζi i.i.d. N(0, 0.52) and, for each ai Rd, we choose uniformly at random one dimension, denoted as supp(i) [d], and set it to a nonzero value, while the remaining dimensions are set to zero. The nonzero value ai[supp(i)] is generated from N(1.0, 0.12). The entries of θ are generated i.i.d. from e N(0,ν2). Therefore, ν controls the variance of entries of θ . We let n = 300 and d = 30. In this setting, we have Li = ai 2 = |ai[supp(i)]|2 1.0, and thus sampling from p IS will perform similarly to uniform sampling. On the other hand, we have fi(x) = |(x θ ) [supp(i)] ai[supp(i)] + ζi| . Published in Transactions on Machine Learning Research (03/2023) Figure 1: Comparison of four methods: SGD, Optimal-SGD, L-SVRG, and Optimal-LSVRG. Columns correspond to different σ values, while rows correspond to different ν values. The stepsize the same for all algorithms, and is 0.1 when ν = 0, is 0.05 when ν = 0.5, and is 0.005 when ν = 1.0. Thus, the variability of the gradient norms is mainly determined by the variance of entries of θ . For each i [n], we can understand fi as a separate univariate quadratic function with the minimizer θ [supp(i)], and the variance of entries of θ can be understood as the concept-shift. In this case, we expect that sampling from p IS will not perform as well as oracle optimal sampling or adaptive sampling. We implement Optimal-LSVRG, IS-LSVRG, and AS-LSVRG with the stochastic gradient obtained from a mini-batch of size 5, rather than choosing only one random sample, to allow adaptive sampling to explore more efficiently.1 The step size is set to 0.3. Figure 4 presents the results. We see that as ν increases, the gap between oracle optimal sampling and sampling from p IS increases as well, due to the concept-shift. In addition, we see that adaptive sampling also performs better than sampling from p IS, despite the fact that it does not use prior knowledge, since adaptive sampling can asymptotically approximate oracle optimal sampling. 1Ada OSMD relies on the feedback obtained by exploration to update the sampling distribution. A larger batch size will allow adaptive sampling to explore more efficiently (in other words, to see more samples in each iteration). Compared with the fixed sampling distribution, where a larger batch size is only reducing the variance of a stochastic gradient, a larger batch size will also help adaptive sampling to make faster updates of the sampling distribution. Therefore, the adaptive sampling strategy is generally more sensitive to batch size than sampling with a fixed distribution. Published in Transactions on Machine Learning Research (03/2023) Figure 2: Comparison of four methods: L-SVRG, Optimal-LSVRG, IS-LSVRG, AS-LSVRG. Columns correspond to different σ values, and rows correspond to different ν values. The stepsize is the same for all algorithms, and is 0.1 when ν = 0, is 0.05 when ν = 0.5, and is 0.005 when ν = 1.0. 4.4 Nonconvex Objective In this section, we compare L-SVRG, IS-LSVRG, and AS-LSVRG with nonconvex objectives under a similar setting as in Section 4.2. We increase d to 100 and n to 1000. Instead of fitting the data with linear regression, we use a two-layer neural network with 10 neurons in the hidden layer. While we still minimize the mean squared error loss, the objective function is now nonconvex due to the nonconvexity of the neural network model. To estimate p IS, we still set p IS i = ai 2/(Pn j=1 aj 2) as in Section 4.2. For AS-LSVRG, we use the OSMD Sampler (Algorithm 3). Both the optimization step size and the learning rate of the OSMD Sampler are tuned such that AS-LSVRG converges at the fastest speed. The result is shown in Figure 5. We see that adaptive sampling still obtains an advantage over uniform sampling and importance sampling, especially when the smoothness heterogeneity is large. It is worth noting that p IS does not perform well in this case. We suspect that this is because ai 2 is a poor estimate of Li in this case; however, it is unclear if there exists an easy way to accurately estimate Li with nonconvex models. This result justifies the motivation of adaptive sampling since it can achieve advantageous performance over uniform sampling without the need to estimate the smoothness constants. Published in Transactions on Machine Learning Research (03/2023) Figure 3: Comparison of four methods: L-Katyusha, Optimal-LKatyusha, IS-LKatyusha, AS-LKatyusha. Columns correspond to different σ values, and rows correspond to different ν values. The stepsizes are set based on Theorem 3. Figure 4: Optimal-LSVRG v.s. IS-LSVRG v.s. AS-LSVRG. Columns correspond to different ν values. Published in Transactions on Machine Learning Research (03/2023) Figure 5: Comparison of L-SVRG, IS-LSVRG and AS-LSVRG with nonconvex objective. Columns correspond to different σ values, and rows correspond to different ν values. The stepsize of each method is tuned such that the method converges in the fastest speed. 5 Real data experiment We use the w8a dataset from Lib SVM classification tasks Zeng et al. (2008); Chang & Lin (2011). On a real dataset, obtaining the theoretically optimal sampling distribution is infeasible, while constructing p IS requires access to Lipschitz constants of each loss function. Therefore, here we only show the performance of L-SVRG and AS-LSVRG on the following logistic regression problem: i=1 (yi log pi + (1 yi) log(1 pi)), where pi(x) = pi = (1 + exp x T zi) 1, yi {0, 1} is the response variable, and zi is the ddimensional feature vector. The stepsizes for both L-SVRG and AS-SVRG are initially tuned over the grid {10 2, 10 1.5, . . . , 102}. The initial search showed us that the optimal stepsize should be in the interval (0, 1). Therefore, we tune the stepsizes over a grid of 20 evenly spaced points on [0.05, 1]. The two algorithms are then used to train the model for 1000 iterations, repeated 10 times, and the best stepsize is chosen by picking the one that corresponds to the lowest loss at the 1000-th iteration. Published in Transactions on Machine Learning Research (03/2023) 0 200 400 600 800 1000 Avg. log(loss) w8a, Batch Size = 1 0 200 400 600 800 1000 1.8 Avg. log(loss) w8a, Batch Size = 5 LSVRG AS-LSVRG Figure 6: LSVRG v.s. AS-LSVRG. Columns correspond to different batch sizes. 0 200 400 600 800 1000 Avg. log(loss) w8a, Batch Size = 1 0 200 400 600 800 1000 Avg. log(loss) w8a, Batch Size = 5 LKatyusha AS-LKatyusha Figure 7: L-Katyusha v.s. AS-LKatyusha. Columns correspond to different batch sizes. The stepsizes are set according to Theorem 3.2 from (Qian et al., 2021) and Theorem 3 in this paper. Figure 6 corresponds to the average log cross entropy loss over 10 runs against the number of iterations. The shaded region corresponds to the standard deviation of the loss. When the batch size is 1, AS-LSVRG and L-SVRG have similar convergence behaviour, but the standard deviation is reduced for AS-LSVRG. When the batch size is 5, AS-LSVRG significantly outperforms L-SVRG. We illustrate the performance of L-Katyusha and AS-LKatyusha by solving the following ℓ2-regularized optimization problem: i=1 (yi log pi + (1 yi) log(1 pi)) + µ where pi = pi(x) has the form as before and µ = 10 7 to ensure that the problem is strongly convex. Figure 7 shows results over 10 runs. AS-LKatyusha significantly outperforms its uniform sampling counterpart. While some of the improvement in performance could be attributed to our superior dependence on the Lipschitz constant, the losses we obtain enjoy slightly reduced variances. Published in Transactions on Machine Learning Research (03/2023) 6 Conclusion and future directions We studied the convergence behavior of L-SVRG and L-Katyusha when non-uniform sampling with a dynamic sampling distribution is used. Compared to previous research, we do not restrict ourselves to a fixed sampling distribution but allow it to change with iterations. This flexibility enables us to design the sampling distribution adaptively using the feedback from sampled observations. We do not need prior information, which can be computationally expensive to obtain in practice, to design a well-performing sampling distribution. Therefore, our algorithm is practically useful. We derive upper bounds on the convergence rate for any sampling distribution sequence for both L-SVRG and L-Katyusha under commonly used assumptions. Our theoretical results justify the usage of online learning to design the sequence of sampling distributions. More interestingly, our theory also explains when adaptive sampling with no prior knowledge can perform better than a fixed sampling distribution designed using prior knowledge. Extensive experiments on both synthetic and real data demonstrate our theoretical findings and illustrate the practical value of the methodology. We plan to extend the adaptive sampling strategy to a broader class of stochastic optimization algorithms. For example, stochastic coordinate descent (Zhu et al., 2016) and stochastic non-convex optimization algorithms (Fang et al., 2018). In addition, exploring adaptive sampling with second-order methods, such as the stochastic Quasi-Newton method (Byrd et al., 2016), could be a fruitful future direction. Published in Transactions on Machine Learning Research (03/2023) Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research, 18:221:1 221:51, 2017. Zalan Borsos, Andreas Krause, and Kfir Y. Levy. Online variance reduction for stochastic optimization. In Conference On Learning Theory, 2018. Zalán Borsos, Sebastian Curi, Kfir Yehuda Levy, and Andreas Krause. Online variance reduction with mixtures. In International Conference on Machine Learning, 2019. Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. Richard H. Byrd, S. L. Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008 1031, 2016. Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1 27, 2011. Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, 2014. Vladimir Estivill-Castro and Derick Wood. A survey of adaptive sorting algorithms. ACM Computing Surveys (CSUR), 24(4):441 476, 1992. Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, 2018. Ayoub El Hanchi and David A. Stephens. Adaptive importance sampling for finite-sum optimization and sampling with decreasing step-sizes. In Advances in Neural Information Processing Systems, 2020. Filip Hanzely and Peter Richtárik. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. In International Conference on Artificial Intelligence and Statistics, 2019. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 2013. Dmitry Kovalev, Samuel Horváth, and Peter Richtárik. Don t jump through hoops and remove those loops: SVRG and katyusha are better without the outer loop. In Algorithmic Learning Theory, 2020. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. Hongseok Namkoong, Aman Sinha, Steve Yadlowsky, and John C. Duchi. Adaptive sampling probabilities for non-smooth optimization. In International Conference on Machine Learning, 2017. Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Mathematical Programming, 155(1-2):549 573, 2016. Yurii Nesterov. Lectures on convex optimization. Springer, 2018. Xun Qian, Zheng Qu, and Peter Richtárik. SAGA with arbitrary sampling. In International Conference on Machine Learning, 2019. Xun Qian, Zheng Qu, and Peter Richtárik. L-svrg and l-katyusha with arbitrary sampling. Journal of Machine Learning Research, 22(112):1 47, 2021. Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951. Published in Transactions on Machine Learning Research (03/2023) Farnood Salehi, L Elisa Celis, and Patrick Thiran. Stochastic optimization with bandit sampling. ar Xiv preprint ar Xiv:1708.02544, 2017. Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83 112, 2017. Zebang Shen, Hui Qian, Tengfei Zhou, and Tongzhou Mu. Adaptive variance reducing for stochastic gradient descent. In International Joint Conference on Artificial Intelligence, 2016. Zhi-Qiang Zeng, Hong-Bin Yu, Hua-Rong Xu, Yan-Qi Xie, and Ji Gao. Fast training support vector machines using parallel sequential minimal optimization. In International Conference on Intelligent System and Knowledge Engineering, 2008. Boxin Zhao, Ziqi Liu, Chaochao Chen, Mladen Kolar, Zhiqiang Zhang, and Jun Zhou. Adaptive client sampling in federated learning via online learning with bandit feedback. ar Xiv preprint ar Xiv:2112.14332, 2021. Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In International Conference on Machine Learning, 2015. Zeyuan Allen Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In International Conference on Machine Learning, 2016. Published in Transactions on Machine Learning Research (03/2023) A Proof of Main Theorems A.1 Proof of Theorem 1 We use the proof technique from Theorem 5 of Kovalev et al. (2020). The key step is to decompose the variance of the stochastic gradient. Let Ft = σ(x0, w0, x1, w1, , xt, wt) be the σ-algebra generated by x0, w0, x1, w1, , xt, wt, and let Et[ ] := E[ | Ft] be the conditional expectation given Ft. Note that Et[gt] = F(xt). By Assumption 3, we have Et h xt+1 x 2i = Et h xt ηgt x 2i = xt x 2 2η F(xt), xt x + η2Et h gt 2i xt x 2 2η F(xt) F(x ) µ 2 xt x 2 + η2Et h gt 2i = (1 ηµ) xt x 2 2η F(xt) F(x ) + η2Et h gt 2i . (13) Furthermore, we have Et h gt 2i = Et h gt Et gt 2i + Et gt 2 = V t e pt F(xt) F(wt) 2 + F(xt) 2 = V t e p IS F(xt) F(wt) 2 + F(xt) 2 + V t e pt V t e p IS fi(xt) fi(wt) 2 F(xt) F(wt) 2 + F(xt) 2 + V t e pt V t e p IS fi(xt) fi(wt) 2 + F(xt) 2 + V t e pt V t e p IS . (14) By Assumption 1 and Assumption 2 that F( ) is convex and LF -smooth, we have F(xt) 2 = F(xt) F(x ) 2 2LF F(xt) F(x ) . (15) With Dt in (10), we have fi(xt) fi(wt) 2 fi(xt) fi(x ) 2 + 2 L fi(wt) fi(x ) 2 1 Li (2Li) fi(xt) fi(x ) fi(x ), xt x + 2 LDt = 4 L F(xt) F(x ) + 2 LDt. (16) Combining (14)-(16), we have Et h gt 2i 4 L F(xt) F(x ) + 2LF F(xt) F(x ) + 2 LDt + V t e pt V t e p IS . (17) Combining (17) and (13), we have Et h xt+1 x 2i (1 ηµ) xt x 2 2η(1 2η L ηLF ) F(xt) F(x ) + 2η2 LDt + η2 V t e pt V t e p IS . Published in Transactions on Machine Learning Research (03/2023) Using Lemma 4, for any β > 0, we have Et h xt+1 x 2i + βEt Dt+1 (1 ηµ) xt x 2 2η(1 2η L ηLF ) 2βρ F(xt) F(x ) + 2η2 L + β(1 ρ) Dt + η2 V t e pt V t e p IS . With β = 4η2 L/ρ, we have Et h xt+1 x 2i + 4η2 L (1 ηµ) xt x 2 2η(1 6η L ηLF ) F(xt) F(x ) Dt + η2 V t e pt V t e p IS . Since η 1/(6 L + LF ), we further have Et h xt+1 x 2i + 4η2 L ρ Et Dt+1 (1 ηµ) xt x 2 + 4η2 L Dt + η2 V t e pt V t e p IS . Recalling that α1 := max n 1 ηµ, 1 ρ xt+1 x 2 + 4η2 L xt x 2 + 4η2 L ρ Dt + η2 V t e pt V t e p IS . Taking the full expectation on both sides and recursively repeating the above relationship from t = T 1 to t = 0, we have E x T x 2 + 4η2 L α1E x T 1 x 2 + 4η2 L ρ DT 1 + η2E V T 1 e p T 1 V T 1 e p IS αT 1 E x0 x 2 + 4η2 L ρ D0 + η2 T X t=0 αT t 1 E V t e pt V t e p IS . A.2 Proof of Theorem 2 We use the technique from Theorem 17 of Qian et al. (2021). The key difference here is the decomposition of the variance of the stochastic gradient. Let Ξt := 1 2ηt xt x 2 + 6ηt L(1 ρ) 5ρ Dt. (18) Let Ft = σ(x0, w0, x1, w1, , xt, wt) be the σ-algebra generated by x0, w0, x1, w1, , xt, wt, and let Et[ ] := E[ | Ft] be the conditional expectation given Ft. Note that Et[gt] = F(xt). We have F(x ) F(xt) + F(xt), x xt = F(xt) + Et gt, x xt = F(xt) + Et gt, x xt+1 + Et gt, xt+1 xt = F(xt) + Et gt, x xt+1 + Et gt F(xt), xt+1 xt + Et F(xt), xt+1 xt . (19) Published in Transactions on Machine Learning Research (03/2023) By Assumption 1 and 2, we have F(xt+1) F(xt) F(xt), xt+1 xt LF xt+1 xt 2 . F(xt) + F(xt), xt+1 xt F(xt+1) LF xt+1 xt 2 . Combined with (19), we have F(x ) Et F(xt+1) LF 2 Et h xt+1 xt 2i + Et gt F(xt), xt+1 xt + Et gt, x xt+1 . (20) Since a, b 1 2β a 2 + β 2 b 2 for all a, b Rd and β > 0 by Young s inequality, we have Et gt F(xt), xt xt+1 β 2 Et h gt F(xt) 2i + 1 2β Et h xt xt+1 2i , β > 0. Equivalently, Et gt F(xt), xt+1 xt β 2 Et h gt F(xt) 2i 1 2β Et h xt+1 xt 2i , β > 0. By Lemma 3, we then have Et gt F(xt), xt+1 xt 2β L F(xt) F(x ) β L fi(wt) fi(x ) 2 2 V t e pt V t e p IS 1 2β Et h xt+1 xt 2i . (21) Combine (20)-(21) and noting that gt, x xt+1 = 1 η xt+1 xt, x xt+1 = 1 xt xt+1 2 + 1 F(x ) Et F(xt+1) LF 2 Et h xt+1 xt 2i β L fi(wt) fi(x ) 2 2 V t e pt V t e p IS 1 2β Et h xt+1 xt 2i 2η Et h xt xt+1 2i + 1 2η Et h xt+1 x 2i 1 = Et F(xt+1) + 1 Et h xt xt+1 2i + 1 2η Et h xt+1 x 2i 1 2β L F(xt) F(x ) β L fi(wt) fi(x ) 2 β 2 V t e pt V t e p IS . 2β L F(xt) F(x ) + β 2 V t e pt V t e p IS + 1 Et F(xt+1) F(x ) + 1 Et h xt xt+1 2i 2η Et h xt+1 x 2i β L fi(wt) fi(x ) 2 . Published in Transactions on Machine Learning Research (03/2023) Then by definition of Dt in (10) and Lemma 4, for any α > 0, we have 2(β L + αρ) F(xt) F(x ) + β 2 V t e pt V t e p IS + 1 xt x 2 + α(1 ρ)Dt Et F(xt+1) F(x ) + 1 Et h xt xt+1 2i 2η Et h xt+1 x 2i + α β L Et Dt+1 . 5η and α = β L 5ρ . Since η 1 6LF , we have 1 β = 1 6η LF 0. Then 4 5 F(xt) F(x ) + 3 5η V t e pt V t e p IS + 1 xt x 2 + 6η L(1 ρ) 5 η L F(xt) F(x ) + 3 5η V t e pt V t e p IS + 1 xt x 2 + 6η L(1 ρ) Et F(xt+1) F(x ) + 1 2η Et h xt+1 x 2i + 6η L(1 ρ) 5ρ Et Dt+1 . From the definition of Ξt in (18), we have Et F(xt+1) F(x ) + Et Ξt+1 Ξt 4 5 F(xt) F(x ) + 3 5η V t e pt V t e p IS Taking the full expectation on both sides and recursively repeating the above relationship from t = T to t = 0, we have t=0 E F(xt+1) F(x ) + Ξt+1 Ξ0 4 t=0 E F(xt) F(x ) + 3 t=0 E V t e pt V t e p IS , which implies that t=1 E F(xt) F(x ) E F(x T +1) F(x ) + ΞT +1 + 1 t=1 E F(xt) F(x ) 5 F(x0) F(x ) + Ξ0 + 3 t=0 E V t e pt V t e p IS . By convexity of F( ) and since ˆx T = (1/T) PT t=1 xt, we have E F(ˆx T ) F(x ) 4 T F(x0) F(x ) + 5Ξ0 t=0 E V t e pt V t e p IS . Finally, by Lemma 1, we have x0 x 2 + 6η L(1 ρ) x0 x 2 + 6η L(1 ρ) 1 Li (2Li) fi(w0) fi(x ) fi(x ), xt x x0 x 2 + 12η L(1 ρ) 5ρ F(w0) F(x ) . Published in Transactions on Machine Learning Research (03/2023) Thus, we have E F(ˆx T ) F(x ) 4 T F(x0) F(x ) x0 x 2 + 12η L(1 ρ) 5ρ F(w0) F(x ) + 3η t=0 E V t e pt V t e p IS . A.3 Proof of Theorem 3 We use the proof technique of Theorem 11 ion Kovalev et al. (2020). The key step is to decompose the variance of the stochastic gradient. We let Ft = σ(x0, w0, v0, z0, , xt, wt, vt, zt) be the σ-algebra generated by x0, w0, v0, z0, , xt, wt, vt, zt, and let Et[ ] := E[ | Ft] be the conditional expectation given Ft. By Assumption 3, we have F(x ) F(xt) + F(xt), x xt + µ = F(xt) + µ xt x 2 + F(xt), x zt + F(xt), zt xt . (22) Note that xt = θ1zt + θ2wt + (1 θ1 θ2)vt. θ1 wt 1 θ1 θ2 zt xt = 1 θ1 θ1 wt 1 θ1 θ2 xt wt + 1 θ1 θ2 Since Et[gt] = F(xt), combining the above relationships with (22), we have F(x ) F(xt) + µ xt x 2 + F(xt), x zt F(xt), xt wt + 1 θ1 θ2 F(xt), xt vt = F(xt) + θ2 F(xt), xt wt + 1 θ1 θ2 F(xt), xt vt xt x 2 + gt, x zt i = F(xt) + θ2 F(xt), xt wt + 1 θ1 θ2 F(xt), xt vt xt x 2 + gt, x zt+1 + gt, zt+1 zt i . By Lemma 5, we have gt, x zt+1 + µ xt x 2 L 2η zt zt+1 2 + Zt+1 1 1 + ηκZt. F(x ) F(xt) + θ2 F(xt), xt wt + 1 θ1 θ2 F(xt), xt vt Zt+1 1 1 + ηκZt + Et gt, zt+1 zt + L 2η zt zt+1 2 . (23) Published in Transactions on Machine Learning Research (03/2023) By Lemma 6, we have zt+1 zt 2 + gt, zt+1 zt 1 F(vt+1) F(xt) η 2 L(1 ηθ1) gt F(xt) 2 . Note that η = θ2 (1+θ2)θ1 . Thus η 2 L(1 ηθ1) = θ2 2 Lθ1 . Then, by (23), we have F(x ) F(xt) + θ2 F(xt), xt wt + 1 θ1 θ2 F(xt), xt vt Zt+1 1 1 + ηκZt + Et F(vt+1) F(xt) θ2 2 Lθ1 = F(xt) + θ2 F(xt), xt wt + 1 θ1 θ2 F(xt), xt vt Zt+1 1 1 + ηκZt + Et F(vt+1) F(xt) θ2 2 Lθ1 V t e pt + θ2 2 Lθ1 F(xt) F(wt) 2 F(xt), xt wt + 1 θ1 θ2 F(xt), xt vt Zt+1 1 1 + ηκZt + Et F(vt+1) F(xt) θ2 2 Lθ1 V t e pt = F(xt) + θ2 F(xt), xt wt + 1 θ1 θ2 F(xt), xt vt Zt+1 1 1 + ηκZt + Et F(vt+1) F(xt) θ2 2 Lθ1 V t e p IS θ2 2 Lθ1 V t e pt V t e p IS = F(xt) + θ2 F(xt), xt wt + 1 θ1 θ2 F(xt), xt vt Zt+1 1 1 + ηκZt + Et F(vt+1) F(xt) fi(xt) fi(wt) 2 θ2 2 Lθ1 V t e pt V t e p IS . By Assumption 1 and 2, and Lemma 2, we have fi(xt) fi(wt) 2 1 1 Li (2Li) fi(wt) fi(xt) fi(xt), wt xt = 2 F(wt) F(xt) F(xt), wt xt . On the other hand, note that F(xt), xt vt F(xt) F(vt). Thus, we further have F(x ) F(xt) + θ2 F(xt), xt wt + 1 θ1 θ2 F(xt), xt vt Zt+1 1 1 + ηκZt + Et F(vt+1) F(xt) F(wt) F(xt) F(xt), wt xt θ2 2 Lθ1 V t e pt V t e p IS = F(xt) + 1 θ1 θ2 F(xt) F(vt) 1 1 + ηκZt θ2 F(wt) F(xt) Published in Transactions on Machine Learning Research (03/2023) F(vt+1) F(xt) θ2 2 Lθ1 V t e pt V t e p IS θ1 F(vt) 1 1 + ηκZt θ2 θ1 F(vt+1) θ2 2 Lθ1 V t e pt V t e p IS = F(x ) 1 θ1 θ2 F(vt) F(x ) 1 1 + ηκZt θ2 F(wt) F(x ) F(vt+1) F(x ) θ2 2 Lθ1 V t e pt V t e p IS . Recalling the definition of Vt in (12), we have Et Zt+1 + Vt+1 (1 θ1 θ2)Vt + 1 1 + ηκZt + θ2 F(wt) F(x ) + θ2 2 Lθ1 V t e pt V t e p IS . Et F(wt+1) F(x ) = (1 ρ) F(wt) F(x ) + ρ F(vt) F(x ) = (1 ρ) F(wt) F(x ) + θ1ρVt, recalling the definition of Wt in (12), we have Et Zt+1 + Vt+1 + Wt+1 (1 θ1 θ2)Vt + 1 1 + ηκZt + θ2 F(wt) F(x ) + θ2 2 Lθ1 V t e pt V t e p IS + θ2(1 + θ1) (1 ρ) F(wt) F(x ) + θ1ρVt + θ2 2 Lθ1 V t e pt V t e p IS = 1 1 + ηκZt + (1 θ1(1 θ2)) Vt + 1 ρθ1 1 + θ1 Wt + θ2 2 Lθ1 V t e pt V t e p IS . By the definition of α2 in Theorem 3 and since θ2 = 1/2, taking the full expectation on both sides, we have E Zt+1 + Vt+1 + Wt+1 α2E Zt + Vt + Wt + 1 4 Lθ1 E V t e pt V t e p IS . Recursively repeating the above relationship from t = T 1 to t = 0, we have E ΨT α2E ΨT 1 + 1 4 Lθ1 E V T 1 e p T 1 V T 1 e p IS αT 2 Ψ0 + 1 4 Lθ1 t=0 αT t 1 2 E V t e pt V t e p IS B Useful Lemmas We state and prove technical lemmas that are used to prove the main theorems. Lemma 1. Let F( ) be defined in (1). Suppose Assumption 1 and Assumption 2 hold. Then F( ) is convex and L-smooth, where L = (1/n) Pn i=1 Li. Proof. Under Assumption 1, F( ) is a linear combination of convex functions and, thus, is convex. To prove that it is L-smooth, we only need to note that F(x) F(y) 1 i=1 fi(x) fi(y) 1 i=1 Li x y = L x y , x, y Rd, where the first inequality follows from the Jensen s inequality and the second inequality follows from Assumption 2. Published in Transactions on Machine Learning Research (03/2023) Lemma 2. Assume that f( ) is a differentiable convex function on Rd and is L-smooth. Then, for all x, y Rd, we have 0 f(y) f(x) f(x), y x L 2 x y 2, (24) f(y) f(x) f(x), y x 1 2L f(x) f(y) 2. (25) Proof. See Theorem 2.1.5 of Nesterov (2018). Lemma 3. Suppose Assumption 1 and Assumption 2 hold. Let xt, wt, gt and pt be defined as in Algorithm 1. We have Et h gt F(xt) 2i 4 L F(xt) F(x ) + 4 L F(wt) F(x ) + V t e pt V t e p IS . Proof. Note that E x E[x] 2 = E x 2 E[x] 2 for any random vector x Rd. Thus we have Et h gt F(xt) 2i = Et fi(xt) fi(wt) F(xt) F(wt) fi(xt) fi(wt) F(xt) F(wt) 2 = V t e pt F(xt) F(wt) 2 = V t e p IS + V t e pt V t e p IS , (26) where V t e (pt) is defined in (2). On the other hand, note that V t e p IS = L n fi(xt) fi(wt) 2 fi(xt) fi(x ) 2 + fi(wt) fi(x ) 2 ) 1 Li (2Li) fi(xt) fi(x ) fi(x ), xt x + fi(wt) fi(x ) 2 ) 4 L F(xt) F(x ) + 2 L fi(wt) fi(x ) 2 , (27) where the second inequality follows Assumption 1, Assumption 2 and Lemma 2, and the last inequality follows from that F(x ) = 0. Combining (26) and (27), we have Et h gt F(xt) 2i 4 L F(xt) F(x ) + 2 L fi(wt) fi(x ) 2 + V t e pt V t e p IS . Lemma 4. Suppose Assumption 1 and Assumption 2 hold. Let Dt be defined as in (10). We have Et Dt+1 2ρ F(xt) F(x ) + (1 ρ)Dt. Published in Transactions on Machine Learning Research (03/2023) Proof. By the update rule of wt, we have fi(wt+1) fi(x ) 2 # fi(wt) fi(x ) 2 + ρ fi(xt) fi(x ) 2 1 Li (2Li) fi(xt) fi(x ) fi(x ), xt x + 1 ρ fi(wt) fi(x ) 2 = 2ρ F(xt) F(x ) + 1 ρ fi(wt) fi(x ) 2 , where the second inequality follows Assumption 1, Assumption 2, and (24) of Lemma 2, and the last inequality follows from F(x ) = 0. Lemma 5. Suppose the conditions of Theorem 3 hold. Then gt, x zt+1 + µ xt x 2 L 2η zt zt+1 2 + Zt+1 1 1 + ηκZt, where Zt is defined in (12). Proof. Note that zt+1 = 1 1 + ηκ ηκxt + zt η where κ = µ/ L. Thus, gt = µ xt zt + L η zt zt+1 , which implies that gt, zt+1 x = µ xt zt+1, zt+1 x + L η zt zt+1, zt+1 x xt x 2 xt zt+1 2 zt+1 x 2 zt x 2 zt zt+1 2 zt+1 x 2 xt x 2 + L 2η zt x 2 (1 + ηκ) zt+1 x 2 L 2η zt zt+1 2 . Combining with the definition of Zt, we then have the final result. Lemma 6. Suppose that the conditions of Theorem 3 hold. Then zt+1 zt 2 + gt, zt+1 zt 1 F(vt+1) F(xt) η 2 L(1 ηθ1) gt F(xt) 2 . Published in Transactions on Machine Learning Research (03/2023) Proof. By the definition of vt+1, we have zt+1 zt 2 + gt, zt+1 zt θ1 zt+1 zt 2 + gt, θ1 zt+1 zt vt+1 xt 2 + gt, vt+1 xt vt+1 xt 2 + F(xt), vt+1 xt + gt F(xt), vt+1 xt vt+1 xt 2 + F(xt), vt+1 xt + L 2 ηθ1 1 vt+1 xt 2 + gt F(xt), vt+1 xt F(vt+1) F(xt) + L 2 ηθ1 1 vt+1 xt 2 + gt F(xt), vt+1 xt , where the last inequality follows Lemma 1 and Lemma 2. By Young s inequality, a, b a 2 β = ηθ1 L(1 ηθ1), we have zt+1 zt 2 + gt, zt+1 zt F(vt+1) F(xt) + L 2 ηθ1 1 vt+1 xt 2 ηθ1 2 L(1 ηθ1) ηθ1 1 vt+1 xt 2 F(vt+1) F(xt) η 2 L(1 ηθ1) gt F(xt) 2 .