# ggd_grafting_gradient_descent__1f7473cf.pdf

Journal of Machine Learning Research 25 (2024) 1-87 Submitted 10/22; Revised 8/24; Published 10/24

GGD: Grafting Gradient Descent

Yanjing Feng yjfeng@mail.nankai.edu.cn NITFID, School of Statistics and Data science Nankai University Tianjin 300071, China

Yongdao Zhou ydzhou@nankai.edu.cn NITFID, School of Statistics and Data science Nankai University Tianjin 300071, China

Editor: Moritz Hardt

Simple random sampling has been widely used in traditional stochastic optimization algorithms. Although the gradient sampled by simple random sampling is a descent direction in expectation, it may have a relatively high variance which will cause the descent curve wiggling and slow down the optimization process. In this paper, we propose a novel stochastic optimization method called grafting gradient descent (GGD), which combines the strength from minibatching and importance sampling, and provide the convergence results of GGD. We show that the grafting gradient possesses a doubly robust property which ensures that the performance of GGD method is superior to the worse one of SGD with importance sampling method and mini-batch SGD method. Combined with advanced variance reduction techniques such as stochastic variance reduced gradient and adaptive stepsize methods such as Adam, these composite GGD-based methods and their theoretical bounds are provided. The real data studies also show that GGD achieves an intermediate performance among SGD with importance sampling and mini-batch SGD, and outperforms original SGD method. Then the proposed GGD is a better and more robust stochastic optimization framework in practice. Keywords: stochastic optimization, importance sampling, minibatching, variance reduction, adaptive stepsize method

1 Introduction

One fundamental problem studied in machine learning is how to fit the model to large data set. The most popular approach is via the empirical risk minimization (ERM), that is,

i=1 fi(x) ,

where x is the d parameters in a pre-defined model, and fi(x) is the loss function of the sample i, i = 1, 2, ..., n, such as the square error loss or hinge error loss. Optimizing the objective function f is of paramount importance. The most well-known method is via the stochastic gradient descent, whose update rule for the model parameters x can be written

2024 Yanjing Feng and Yongdao Zhou.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-1236.html.

Feng and Zhou

xk+1 = xk γ fik(xk),

where ik is sampled from [n] = {1, 2, ...n} uniformly, γ is a suitable stepsize, fik(xk) = fik

x1 (xk), ..., fik

xd (xk) . SGD has played a central role in large-scale machine learning (Robbins and Monro, 1951; Shalev-Shwartz et al., 2011; Hardt et al., 2016; Bottou et al., 2018; Gorbunov et al., 2020), since it tremendously reduces the computational cost compared with the gradient descent (GD). Unfortunately, SGD suffers from the variance brought by the sample gradient fik(x). In terms of practice, main problem with SGD is countering its variance to accelerate the training process.

1.1 Direct Approaches

Some techniques can be used to directly tackle the problem of variance. They mainly fall into three categories. Minibatching. Minibatching can reduce the variance by a constant factor. Using this strategy does not result in an improvement of convergence rate (Shalev-Shwartz et al., 2011; Bottou et al., 2018), but can lead to acceleration. Minibatching is commonly used in modern deep learning settings since it can be running in a parallel fashion which is computational friendly for implementation. Importance sampling. Importance sampling refers to the technique of assigning carefully designed non-uniform probabilities to data samples and using these probabilities to select the data point during the iterative training process (Needell et al., 2014; Zhao and Zhang, 2015; El Hanchi and Stephens, 2020). Although effective, it often requires more computational resources to set up the sampling mechanism. With the help of importance sampling, the variance of stochastic gradient can be reduced by a factor as well. Variance reduction methods Although above two primitive techniques can be implemented easily and reduce the variance to some extent, they can not eliminate the variance completely. To improve the convergence rate of stochastic optimization methods with fixed stepsize, lots of advanced algorithms concerning the variance reduction have been proposed in recent years (Gower et al., 2020), such as stochastic average gradient (SAG, Le Roux et al., 2012; Schmidt et al., 2017), SAGA (Defazio et al., 2014), stochastic variance reduced gradient (SVRG, Johnson and Zhang, 2013), stochastic recursive gradient algorithm (SARAH, Nguyen et al., 2017), Katyusha (Allen-Zhu, 2017), variance reduced stochastic gradient descent (VR-SGD, Shang et al., 2018), and the simple stochastic variance reduced algorithm (Mi G, Zhou et al., 2018). All of those methods have modified the original stochastic sample gradient fik(x) in each step to progressively reduce its variance as an estimator of the full gradient. Among them, SVRG first introduces the bi-loop structure where the parameters are updated in the inner loop and the reference point and the full gradient are updated in the outer loop. Compared with the concurrently proposed methods SAG and SAGA, SVRG does not require storing a Jacobian matrix in the training process and produces an unbiased estimator of the full gradient in each step. It is worth noting that another weakness of variance reduction methods is that they are inefficient and the reduction in variance is insignificant for deep learning (Defazio and Bottou, 2019).

GGD: Grafting Gradient Descent

Minibatching, importance sampling and advanced variance reduction methods are often combined to be used for achieving an amplification effect. Thought of combining these techniques is by no means new. We list some examples as follows. Minibatching and importance sampling. The most natural way is to predefine a sampling probability of each data sample and uses a mini batch of sampled gradients (averaging multiple sampled gradients) to update the model parameters (Zhao and Zhang, 2015; Qian et al., 2019). Csiba and Richt arik (2018) propose importance sampling for mini-batches, which gives the answer to the problem which subset should we choose in every iteration. A key characteristic of those proposed methods is that they will define a sampling probability on the entire data set and then do the sampling step to accelerate the training process. Minibatching and variance reduction. Reddi et al. (2016) give the experiment and convergence results for mini-batch nonconvex SVRG, and concurrently Allen-Zhu and Hazan (2016) provide the convergence results for non-convex SVRG and claim that their convergence results can be extended to the mini-batch setting. Yang et al. (2021) study the mini-batch SARAH with random Barzilai-Borwein method. Gazagnadou et al. (2019) study the optimal mini-batch size for SAGA. m S2GD proposed by Koneˇcn y et al. (2015) is another example of this kind of hybrid. Minibatching, importance sampling and variance reduction. Horv ath and Richt arik (2019) pioneer the study of variance reduction method with minibatching and importance sampling in the nonconvex problem.

1.2 Indirect Approaches

Different from the methods which directly reduce the variance of stochastic gradient. Another way to accelerate the training process is tunning the stepsize. Diminishing stepsize sequences and adaptive stepsize methods are two representative examples of this kind. They are introduced as follows. Diminishing stepsize sequences. Although using a diminishing stepsize sequence (Cotter et al., 2011) can eliminate the variance gradually through iterations, it also slows down the convergence rate of an algorithm as what we illustrate in Section 4. Moreover, the performance of SGD algorithm can be deteriorated by a wrongly hand-picked sequence. A suitable stepsize sequence can not be obtained without the trial and error. Adaptive stepsize methods. Unlike the diminishing stepsize sequences which put the same stepsize on each dimension of the model parameters. Adaptive stepsize methods assign different stepsizes for each dimension of the model parameters and update them separately. Many stochastic first-order optimization methods with adaptive stepsize and momentum has been proven both theoretically and empirically that they can accelerate the training process such as SGD with momentum (Liu et al., 2020), RMSprop (Tieleman et al., 2012; Zou et al., 2019), Adadelta (Zeiler, 2012), Adagrad (Duchi et al., 2011; Ward et al., 2020), Adam (Kingma and Ba, 2014) and AMSgrad (Reddi et al., 2019; Tran et al., 2019). Among them, Kingma and Ba (2014) decide to use exponential moving average to cumulate the past gradients with heavy-ball style momentum which is used to determine the direction and the original magnitude of the update, and as well cumulate their elementwise square with the corrective terms to modify the original magnitude of the update so

Feng and Zhou

that the element-wise second moment of the update can be regularized to approximate 1. The proposed Adam is now one of the most popular adaptive stepsize method used in deep learning community.

1.3 Our Contributions

Motivated by these composite methods, we propose a novel stochastic optimization method, grafting gradient descent (GGD), which borrows the strength from minibatching and importance sampling. Since stochastic sample gradient has intrinsic high variance, we replace the stochastic gradient with a brand new grafting gradient to update the model parameters, and integrate the grafting gradient with the high-level variance reduction methods and adaptive stepsize methods respectively. Our contributions of this work are as follows:

Grafting gradient has a smaller noise variance compared with the stochastic sample gradient and can be calculated in a parallel fashion to speed up the training process.

Two types of GGD methods are proposed, GGD using sampling with replacement and GGD using sampling without replacement. For the former one, we prove that the noise variance of grafting gradient can be written as the weighted sum of the noise variance of SGD with importance sampling and the noise variance of mini-batch SGD which guarantees grafting gradient a doubly robust estimator with respect to the full gradient.

For the latter one, we show that vanilla SGD, SGD with importance sampling, minibatch SGD can all be regarded as the special cases of GGD using sampling without replacement. A unified bound is obtained through the convergence analysis of GGD using sampling without replacement.

Two types of GGD methods both have a sublinear rate of convergence under stronglyconvex assumption when using a diminishing stepsize sequence and a linear convergence rate up to some noise level with the fixed stepsize. We also provide the convergence analysis for GGD in the general convex and non-convex cases.

Grafting gradient is also compatible with advanced variance reduction method and adaptive stepsize method respectively. The convergence analysis of GGD-based variance reduction method and GGD-based adaptive stepsize method are provided and we show that these methods converge much faster than the original GGD-based methods.

The rest of this paper is organized as follows. Section 2 presents some definitions and notations for further convergence analysis. Section 3 introduces the grafting gradient and the detail of two type GGD algorithms. Section 4 gives the convergence results of GGD for strongly-convex, convex and non-convex objective function respectively. In Section 5, we hybridize SVRG and Adam with the grafting gradient using sampling with replacement (WR), introduce GGD-WR-SVRG and GGD-WR-Adam methods, and provide their theoretical properties. Section 6 gives the experimental results. Section 7 gives some conclusions and discussions for the future research. All the proofs are listed in the Appendix. All the codes are available at https://github.com/oo0mmmm/GGD.

GGD: Grafting Gradient Descent

2 Background and Problem Setup

We first introduce some notations which are repeatedly used in the rest of this paper. Let

i S fi(x), LS = 1

i S Li, and f S,min = 1

i S fi,min,

where S is a subset of training set, Li is the smoothness constant and fi,min is a lower bound of fi(x). To proceed with the convergence analysis, we also presents some basic definitions which are widely used in stochastic optimization as follows.

Definition 1 (L-smoothness) Function f: Rd R, is L-smooth if it is continuously differentiable and the gradient function of f, namely, f : Rd Rd, is Lipschitz continuous with Lipschitz constant L > 0, i.e.,

fi(x) fi( x) 2 L x x 2, for all (x, x) Rd Rd.

Intuitively, Definition 1 says that the gradient of function f does not change arbitrarily quickly with respect to the parameters. Smoothness assumption is essential for the convergence analysis of the most gradient-based methods. For simplicity, we will use to represent the L2-norm throughout this paper.

Definition 2 (µ-strongly convex) Function f : Rd R, is µ-strongly convex if there exists a constant µ > 0 such that

f( x) f(x) + f(x)T ( x x) + µ

2 x x 2, for all ( x, x) Rd Rd.

Definition 3 (convex) Function f : Rd R, is convex if

f( x) f(x) + f(x)T ( x x), for all ( x, x) Rd Rd.

The convexity assumptions are also essential for the most of convergence analysis in this paper. Note that µ-strongly convexity implies convexity but not vice versa. Throughout the rest of this paper, we assume that under strongly-convex or convex assumptions, there exists an optimal solution x such that

x = arg min x Rd f(x).

3 Grafting Gradient Descent Algorithm

The key insight of our work is that minibatching and importance sampling can collaborate in a different way. This new technique is called importance resampling which successively employs importance sampling on batch of the subsampled sets. Let Dm = {Sm | Sm {1, 2, ..., n}, |Sm| = m}. Technically importance resampling consists of three steps:

First sample a batch of sets Smr Dm. Denote this batch of subsets by Sb m = {Sm1, Sm2, ..., Smb}.

Feng and Zhou

Algorithm 1: Grafting Gradient Descent

Input: The batch size b, subsampled set size m and the learning rate γ. Initialize: x0

for k = 0, 1, 2, ..., T 1 do Option (a): Sample Sb m = {Sm1, ..., Smb} with replacement from Dm. Option (b): Sample Sb m = {Sm1, ..., Smb} without replacement from Dm.

Compute PSmi =

f Smi (xk) f Smi ,min

Pb j=1 f Smj (xk) f Smj ,min for i = 1, 2, ..., b.

Denote P = (PSm1, ..., PSmb) . Resample {Smr1, ..., Smrd} from Sb m based on the resampling distribution P. Compute the grafting gradient as

1 m P i Smr1 fi x1 (xk)

1 m P i Smrd

Update: xk+1 = xk γgm,b(xk)

Put a probability measure P on each element of set Sb m based on some values.

Use this probability measure P to resample d (equals to the dimension of model parameters x) examples from set Sb m.

The resampling result actually indicates which subset we should use when calculating the partial derivatives in each dimension. As an illustration, assuming that the resampling result is {Smr1, Smr2, ..., Smrd} which means that for all k = 1, 2, ..., d, we need to calculate the k-th component of the sample average gradient with respect to the subset Smrk, that is, f Smrk / xk. Combining these total d average partial derivatives, we now construct a grafting gradient which can be used to update the model parameters. The word grafting means that this gradient is synthesis and the components of this gradient are determined in a particular way for the purpose of variance reduction. If set Smr is sampled from Dm without replacement, then the corresponding algorithm is called grafting gradient descent using sampling without replacement (GGD-Wo R). Its counterpart is called grafting gradient descent using sampling with replacement (GGD-WR) when Smr is sampled from Dm independently. The detailed procedure of GGD is shown in Algorithm 1. From Algorithm 1, intuitively we can get some insights on why GGD may outperform mini-batch SGD and SGD with importance sampling. For the one hand, mini-batch SGD only select one batch of data samples whereas GGD uses important subsets from candidate set Sb m, which may provide more useful estimations of the full gradient. On the other hand, importance resampling injects additional randomness into GGD compared with SGD with importance sampling, and promotes the diversity of selected data samples, which may profit

GGD: Grafting Gradient Descent

Figure 1: Toy example illustrates the effectiveness of importance resampling

the estimation of the full gradient as well. To get a comprehensive view of GGD method, let us set aside the discussion and focus on a toy example where we have a set of univariate functions Fs = {f1(x), ..., f22(x)} which are defined as

fi(x) = (0.15 + 0.1(i 1))x2 , for i = 1, ..., 20 and fi(x) = 11x , for i = 21, 22,

and our goal is to minimize the average of these functions which is depicted in Figures 1(a) and 1(b) by black quadratic curves, that is,

i=1 fi(x) = x2 x.

Suppose that the initial starting point is x = 1/4 and the learning rate is γ = 1/2. The best approach that we can take is gradient descent which calculate the full gradient f(x) at x = 1/4 and update x by x x γ f(x) = 1/4+(1/2) (1/2) = 1/2 as shown by violet solid line with arrow in Figure 1(a). It is obvious that after one gradient step, we easily find the global minimum of f(x) with the proper learning rate as indicated by the violet dash line in Figure 1(b). Except for gradient descent, we can achieve this goal by other approaches such as mini-batch SGD, mini-batch SGD with important sampling and GGD at a lower computational cost. Now let us dive a little deeper to see how these method will perform in this toy example. Suppose that for these three methods, we are only allowed to use a size 2 subset of function set Fs. Then for mini-batch SGD, by sampling without replacement, we are likely obtaining a mini batch without f21(x) and f22(x) such like S0 = {f4(x), f17(x)}. Based on that, mini-batch SGD calculate the gradient with respect to S0 at x = 1/4 and update x by x x γ f S0(x) = 1/4 11/40 = 1/40 as shown by green solid line with arrow in Figure 1(a). It moves x in the opposite way of the descent direction and increases the objective function value as indicated by green dash line in Figure 1(b). For mini-batch SGD with importance sampling, supposing that f21(x) and f22(x) are of great importance so that for a randomly selected index ir {1, ..., 22}, P21 = P(ir =

Feng and Zhou

21) = 0.45, P22 = P(ir = 22) = 0.45 and Pj = P(ir = j) = 0.005 for j = 1, ..., 20. Based on that, through importance sampling we are likely obtaining {f21(x), f22(x)} and as shown by cyan solid line with arrow in Figure 1(a), SGD with importance sampling may update x following

1 n P21 f21(x) + 1 n P22 f22(x) = 1

4 + 11 22 0.45 = 29

Although SGD with importance sampling moves x along the descent direction, it takes a giant step so that x arrives at 29/36 which is even more far from the optimum point x = 1/2 as indicated by cyan dash line in Figure 1(b). The reason why these two methods fail to minimize the objective function in one step is that sampling without replacement used by mini-batch SGD ignores the valuable information contained by important compenents such as f21(x) and f22(x) which determines the descent direction, and importance sampling lacks the randomness to some extent so that the selected components are individually very informative but not necessarily so jointly. Using importance sampling may form a batch of important but redundant components which lacks diversity and may negatively impact the performance of mini-batch SGD with importance sampling. For GGD, if we implement importance resampling which only resamples one subset out of b = 10 randomly selected subsets {Sm1, ..., Sm10} in our toy example, then 10 randomly selected subsets may contain only one subset such like Sm3 = {f11(x), f21(x)} with important component f21(x) since the probability of 10 randomly selected subsets containing at least one important component is

P Smj {Sm1, ..., Sm10} s.t. f21(x) Smj or f22(x) Smj = 1 C2 20 C2 22

where Ck n = n!/(k!(n k)!) is the number of k-combinations from a given set of n elements. Then through the resampling procedure, this subset {f11(x), f21(x)} is likely to be the chosen one which is used to calculate the grafting gradient, and as shown by red solid line with arrow in Figure 1(a) GGD may update x following

x x γ 1 10PSm3

2 f11(x) + 1

2 f21(x) 0.5629.

Although GGD fails to find the global minimizer in one gradient step, it moves x quite near to the optimum point and decreases the objective function value most compared with minibatch SGD and SGD with important sampling as indicated by red dash line in Figure 1(b). Importance resampling can guarantee us a subset which contains important and diverse components with high probability, and consequently GGD can provide more informative and accurate estimation of the full gradient. The relationship among mini-batch SGD, SGD with importance sampling and GGD reminds us of the notion of exploration vs exploitation in many other domains. From this perspective, GGD can be regarded as the trade off between the exploitation and exploration , and we hope that GGD could achieve comparable results when applying to more complex and high-dimensional problems.

GGD: Grafting Gradient Descent

Now let us get back to technical work and put some remarks on the resampling distribution. In the GGD method, the noise variance comes from E gm,b(xk) 2 which is equivelant to

E gm,b(xk) 2 = E

1 PSmi f Smi(xk) 2 #

Given xk and Sb m = {Sm1, ..., Smb}, denoting that b is the b-dimensional simplex, we know that the solution of

1 PSmi f Smi(xk) 2 (1)

P opt Smi = f Smi(xk) Pb j=1 f Smj (xk) , i = 1, 2, ..., b. (2)

The optimal resampling distribution defined in (2) minimizes the noise variance of grafting gradient. However, deriving this optimal resampling distribution requires evaluations of bmd partial derivatives. In analogy with Zhao and Zhang (2015), if we assume that the individual loss function fi is Li-smooth, bounded below by fi,min, then f Smi(xk) 2 can be bounded by

f Smi(xk) 2 2LSmi

f Smi(xk) f Smi,min 2Lmax f Smi(xk) f Smi,min ,

where Lmax = maxi [n]{Li}. f Smi,min can be easily estimated if we know a uniform lower bound for fi(xk) and fortunately if loss function fi is non-negative, then we can let f Smi,min = 0 although it may rather be pessimistic. Noting that f Smi(xk) 2 can be

bounded by 2Lmax f Smi(xk) f Smi,min , then we can relax the optimization problem (1) following the analysis provided by Zhao and Zhang (2015) as

1 PSmi f Smi(xk) 2 min P b

f Smi(xk) f Smi,min .

Since Lmax is independent of timestep and subset selection, the optimal resampling proba-

bility can be approximated by P Smi f Smi(xk) f Smi,min 1/2 . To simplify our analysis, we use a resampling probability in square form instead, that is,

f Smi(xk) f Smi,min

Pb j=1 f Smj (xk) f Smj ,min . (3)

Feng and Zhou

It is clear that in one iteration, Algorithm 1 only requires evaluations of mb loss function values and md partial derivatives which is computationally cheaper than deriving the optimal resampling probability. We do not adopt the one-shot resampling probability such as Zhao and Zhang (2015) where PSmi LSmi because one drawback of the one-shot resampling probability, which is defined in terms of the L-smoothness constants, is that it may not be applicable in neural network setting where the individual L-smoothness constant does not has a closed form and requires additional techniques to estimate. On the contrary, since the individual loss function value can be explicitly obtained through the forward-propagtion of neural network, the resampling probability PSmi redresses the flaw of one-shot resampling probability albeit at a higher computational cost. For numerical stability, without loss of generality, we assume that PSmi > 0 for i = 1, ..., b. If there exist some Smp {Sm1, ..., Smb} such that f Smp(xk) f Smp,min = 0, then the corresponding sample will not show up in the grafting gradient. When such a situation happens, we should remove this sample and add a new one until we put non-zero mass on every element in the set Sb m. Furthermore, if there is no set Sb m which satisfies this constraint, we just put equal mass on every sample and resample them to construct the grafting gradient. We also notice that the components of the grafting gradient in Algorithm 1 are reweighted by the batch size and corresponding probabilities to ensure the unbiasedness of grafting gradient with respect to the full gradient.

4 Convergence Results for GGD

Now we can present the convergence results for Algorithm 1 under different assumptions. Convergence results for GGD-WR and GGD-Wo R are presented in tandem.

4.1 Convergence Result for GGD-WR under Strongly-convex Assumption

Under the assumptions of strong convexity and L-smoothness, a theoretical bound can be obtained as follows.

Theorem 4 Suppose that the objective function f is L-smooth, the individual loss function fi is µ-strongly convex, Li-smooth, bounded below by fi,min for all i [n]. When Algorithm 1 is run with the fixed stepsize where γ < min{2/µ, 2b/ C + 2 L(b 1) } and option (a), the iterates generated by Algorithm 1 satisfy

E h x T x 2i (1 µγ)T x0 x 2 + 2γR

where L = (1/n) Pn i=1 Li, R = f(x ) fmin, fmin = (1/n) Pn i=1 fi,min,

C = 2Lmax(n m) + 2n(m 1)L

and D = Lmax(n m)

m(n 1) + (b 1) L 1 ,

are constants which are independent of iteration number T.

From Theorem 4, we know that the iterates generated by GGD-WR with the fixed stepsize converge at a linear rate up to some noise level. Let us take a deep look at the noise level 2γR/µb D. It seems that some constants may explode such as C and D since they

GGD: Grafting Gradient Descent

show dependence on the size of training set n, but actually not for any reasonable values of m and b. For fixed n, C is monotonically decreasing with respect to the subsampled set size m, thus C can be bounded by 2Lmax. As for D, since (n m)/m is decreasing with respect to m, then 1/b D can be bounded by Lmax + L, which proves that 2γR/µb D does not explode for any possible values of m and b. To see the superiority and robustness of GGD-WR method, we derive the theoretical bounds for vanilla SGD, mini-batch SGD and SGD with importance sampling under the same assumptions. For the iterates generated by SGD, they satisfy

E x T SGD x 2 (1 µγ)T E x0 x 2 + 2γLmax R

The derivation for equation (4) can be found in Appendix A.2. After some straightforward calculations, we know that 2R/b D 2Lmax R holds for any b and m N, which implies that by importance resampling, variance brought by the grafting gradient is always less than the variance brought by the stochastic gradient in every iteration. Hence the training process can be boosted when replacing stochastic gradient with grafting gradient in each step. Although GGD-WR has the same convergence rate as SGD, it reduces the noise variance to some extent. In other words, the solution path found by the GGD-WR algorithm fluctuates in a smaller neighborhood of the optimum value compared with the vanilla SGD method. Batch size b and the size of subsampled sets m influences the radius of this neighborhood. The larger m is, the smaller the radius is. The effect brought by b depends on the relationship between Lmax(n m)/m(n 1) and L. That is, if L Lmax(n m)/m(n 1), then increasing b will enlarge the radius of this neighborhood, otherwise, increasing b leads to narrowing the radius of this neighborhood. We also derive the bounds for mini-batch SGD and SGD with importance sampling. For the latter one, the sampling probability is given by Pi = Li/ Pn j=1 Lj for all i [n]. Under the same assumptions, their convergence bounds are given as follows.

For mini-batch SGD, the iterates satisfy

E x T m SGD x 2 (1 µγ)T E x0 x 2 + 2γRLmax(n m)

m(n 1)µ . (5)

For SGD with importance sampling, the iterates satisfy

E x T ISSGD x 2 (1 µγ)T E x0 x 2 + 2γ LR

From (5), (6) and Theorem 4, we can see that the noise variance variance of GGD-WR is the weighted sum of the noise variance of mini-batch SGD and the noise variance of SGD with importance sampling with weights (1/b, (b 1)/b). This result implies two properties of GGD-WR.

Robustness: 2R/b D is not greater than max{2Lmax R(n m)/m(n 1), 2 LR}.

Tendency: Batch size b controls the weights. The larger b is, the closer the noise variance of GGD approaches to the noise variance of SGD with importance sampling.

Feng and Zhou

For the former one, we say that grafting gradient is a doubly robust estimator with respect to the full gradient in sense that its theoretical bound is superior to the worse one of mini-batch SGD and SGD with importance sampling. Doubly robust property may be of great help when we do not know whether mini-batch SGD or SGD with importance sampling will surpass for real problems. In section 6, we empirically compare these four method (vanilla SGD, GGD, mini-batch SGD and SGD with importance sampling) and bring out more discussions about their superiorities and applicabilities. For the latter one, Although in GGD-WR algorithm b can be arbitrarily large so that the tremendous computational cost brought by loss function evaluation may be unaffordable, the tendency property of GGD suggests that GGD does not benefit and even can be harmed from blindly increasing batch size b when the noise variance of mini-batch SGD is smaller than the noise variance of SGD with importance sampling. Thus for the rest of our paper, unless specifying, batch size b is assigned by default a relatively small value such like b {2, 3, 4} which balances the trade off between the computational cost and the performance, and moreover implies that the computational cost of loss function evaluations is insignificant compared with the computational cost of partial derivatives evaluations. The convergence analysis would be incomplete without considering how theoretical results impact on the computational workload when the stochastic optimization methods are applied for the real problems. Complexity results give the bound for the total number of partial derivatives evaluations as the main computational complexity to achieve the ϵ-optimality in expectation. The ϵ-optimality in expectation is defined as a point x satisfies

E f(x) 2 ϵ, or E x x 2 ϵ, or E [f(x) f(x )] ϵ,

where x is assumed to be the global minimizer of f. Unless specifying, convergence and complexity results will be provided in tandem for the rest of this paper.

Corollary 5 If we choose stepsize γ = min{2/µ, 2b/ C + 2 L(b 1) , ϵµb D/4R}, then to achieve ϵ-optimality, the total iteration number T should satisfy

2, (C + (b 1)2 L)

2bµ , 4R ϵµ2b D

ln 2E x0 x 2

Hence the total complexity to achieve ϵ-optimality is

2, (C + (b 1)2 L)

2bµ , 4R ϵµ2b D

ln 2E x0 x 2

Suppose that m n, then the iteration complexity result will be O (d/ϵ ln(1/ϵ)). Combining this result with Theorem 4, we know that compared with vanilla SGD method, the noise variance of GGD-WR method shrinks by a constant factor so that GGD-WR improves a non-dominant term in complexity. A diminishing stepsize sequence is another choice to reduce the noise variance. Its convergence result is shown as follows.

Theorem 6 Suppose that the objective function f is L-smooth, the individual loss function fi is µ-strongly convex and Li-smooth, bounded below by fi,min for all i [n]. When

GGD: Grafting Gradient Descent

Algorithm 1 is run with option (a) and a stepsize sequence which satisfies for all k = 0, 1, 2, ...,

γk = p q + k for some p > 1

µ and q > 0 such that p q < 2b (C + 2(b 1) L).

Then for each k, the expected optimality gap satisfies

E xk x 2 v q + k,

where v = max{2p2R/(pµ 1)b D, p E x0 x 2 }.

Since 1/b D can be bounded for any possible values of m and b, v will not explode and the expected optimality gap can be controlled with large enough iteration number T. Theorem 6 shows that when using a diminishing stepsize sequence, the iterates generated by Algorithm 1 with option (a) can converge to the optimum point x at a sublinear rate. Although the convergence rate is much slower than we have obtained in Theorem 4, it eliminates the variance brought by grafting gradient and reduces the training time. Theorem 6 also indicates that the total iteration number T should satisfy T (v/ϵ) q to achieve ϵoptimality.

Corollary 7 ϵ > 0, the total complexity to achieve ϵ-optimality is

Again if m n, then the complexity result will be O(d/ϵ) which indicates that using the grafting gradient only improves a non-dominant term in complexity. Although the convergence rate is sublinear, the complexity bound improves from O ((d/ϵ) ln 1/ϵ) to O (d/ϵ), which means that GGD-WR with the diminishing stepsize sequence requires less iterations to achieve the same level ϵ-optimality. The cause behind this counterintuitive phenomenon is that to achieve ϵ-optimality, the fixed stepsize should keep the same order as ϵ, while for the diminishing stepsize sequence, it begins with p/q and gradually decreases.

4.2 Convergence Result for GGD-WR under Convex Assumption

In this section, we study the convergence property of the GGD-WR method for the general convex individual loss function. For simplicity, we do not compare the theoretical bounds of GGD-WR, SGD, mini-batch SGD and SGD with importance sampling under convex or non-convex assumptions.

Theorem 8 Suppose that the objective function f is L-smooth, the individual loss function fi is Li-smooth and convex, bounded below by fi,min for all i [n]. When Algorithm 1 is run with γ < 3b/2 C + 2 L(b 1) and option (a), denoting ˆx = 1

T PT 1 k=0 xk, then it satisfies

E [f(ˆx) f(x )] 2E x0 x 2

Feng and Zhou

We notice that when strongly-convex assumption does not hold, the expected difference in objective function value between any iterate and the optimum point x , E f(xk) f(x ) , does not have a quadratic lower bound, that is, can not be bounded below by O E xk x 2 . The non-existence of a quadratic lower bound influences the convergence property of GGDWR method. As derived in Theorem 8, when a fixed stepsize is used, the GGD-WR method only converges at a sublinear rate up to some noise level for general convex objective function. Consequently, we have the complexity result in the general convex case.

Corollary 9 If we take the stepsize which satisfies

γ = min{3b/2(C + (b 1)2 L), ϵb D/8R},

to achieve ϵ-optimality, then total iteration number T should satisfy

T 4E x0 x 2

ϵ min{3b/2(C + (b 1)2 L), ϵb D/8R}.

Thus the total complexity is

md 4E x0 x 2

ϵ min{3b/2(C + (b 1)2 L), ϵb D/8R}.

From Corollary 5 and 9, we see how the convergence results impact on the complexity results. Since the GGD method does not possess a linear convergence rate under general convex assumption, for m n, the complexity is O(d/ϵ2) which is far larger than O(d/ϵ).

4.3 Convergence Result for GGD-WR under Non-convex Assumption

Following the analysis provided by Khaled and Richt arik (2020), we can take a step towards the theoretical bound for GGD-WR method without any additional assumption at all and provide the following convergence results.

Theorem 10 Suppose that objective function f is L-smooth, the individual loss function fi is Li-smooth, bounded below by fi,min for all i [n]. When Algorithm 1 is run with option (a) and fixed stepsize γ (2b 1)/L, then the iterates generated by Algorithm 1 satisfy

min k=0,1,...,T 1 E f(xk) 2 2γL

1 + b D γ2LT

where δ0 = E f(x0) fmin is a constant.

Non-convex GGD-WR shows a sublinear convergence rate up to some noise level as well. Difference between convex GGD-WR and non-convex GGD-WR is that the theoretical bound of non-convex GGD-WR shows possible divergence since 1/D is monotonically increasing with respect to b. However, the minimum of expected gradient norm can be controlled arbitrarily small by manually assigning b a small value and using a designated stepsize. This bound is in fact optimal for GGD-WR without additional assumptions on second-order smoothness such like Polyak Lojasiewicz condition since the convergence rate of non-convex GGD-WR can not exceed the convergence rate of convex GGD-WR. The following corollary states the complexity results to attain the ϵ-stationary point.

GGD: Grafting Gradient Descent

Corollary 11 With the stepsize

γ = min 2b 1

Algorithm 1 with option (a) can still achieve ϵ-optimality as long as the iteration number T satisfies

ϵ max L 2b 1, 4Lδ0

Hence to achieve ϵ-optimality, the total complexity is

ϵ max L 2b 1, 4Lδ0

When the strongly convex assumption does not hold, for m n, the complexity bound degrades from O(d/ϵ) to O(d/ϵ2). In other words, it takes more time to achieve the same level ϵ-optimality for general convex or non-convex functions. Although the complexity bounds of convex GGD-WR and non-convex GGD-WR are identical for m n, the average objective function value can be bounded under convex assumption, while we can only control the minimum expected gradient norm for non-convex objective function.

4.4 Convergence Result for GGD-Wo R under Strongly-convex Assumption

In the rest of Section 4, we mainly focus on the convergence results of GGD-Wo R under different assumptions. Through the analysis of GGD-Wo R, we find that while the doubly robust property does not hold anymore, GGD-Wo R can be considered as a more general framework since it can include vanilla SGD, mini-batch SGD and SGD with importance sampling as special cases. We first derive the unified theoretical bound under stronglyconvex assumption and show how GGD-Wo R connects those classic stochastic optimization methods.

Theorem 12 Suppose that the objective function f is L-smooth, the individual loss function fi is µ-strongly convex, Li-smooth, bounded below by fi,min for all i [n]. When Algorithm 1 is run with the fixed stepsize where γ min 2/µ, b2m2/M and option (b), the iterates generated by Algorithm 1 satisfy

E h x T x 2i (1 µγ)T x0 x 2 + 2γRM

µb2m2 , (8)

where M = n2 M2 L + n(M1 M2) L ,

n + mb(Cm 1 n 1 1)(b 1) n(Cm n 1) , M2 = mb(b 1)Cm 1 n 1 n(Cm n 1) + mb(m 1)(Cm n b) n(n 1)(Cm n 1) ,

L = I{M1 M2} Lmax + I{M1<M2} Lmin and Lmin = mini [n] Li are constants which are independent of iteration number T.

Feng and Zhou

Theorem 12 suggests that the iterates generated by GGD-Wo R converge at a linear rate up to some noise level similar to that of GGD-WR algorithm. To see whether M/b2m2 will explode for any reasonable values of m, b and n, we first study two quantities M1 and M2. M1 = O(m2b2/n2) and M2 = O(m2b2/n2) implies that M/b2m2 = O (1 + 1/n), thus the noise variance does not explode under any circumstances. It is also clear that since the noise variance of GGD-Wo R is related to the minimum of the individual smoothness constants, doubly robust property does not hold for an arbitrary choice of m N +, b [Cm n ]. Luckily, we find that for certain configurations of b and m, the procedures of GGD-Wo R are identical to that of vanilla SGD, mini-batch SGD and the theoretical bound of GGD-Wo R is identical to that of SGD with importance sampling in expectation. We first verify the case for vanilla SGD.

Proposition 13 If m = 1 and b = 1, then the bound given by Theorem 12 is equivalent to (4).

When m = 1 and b = 1, intuitively, this configuration indicates that there is no resampling at all and only one stochastic sampled gradient can be used, which is exactly how vanilla SGD works in practice. By some straightforward calculations, we can also obtain that M1 = 1/n and M2 = 0, then the theoretical upper bound for the noise variance of GGDWo R is 2γRLmax/µ, the same as the result derived in (4).

Proposition 14 If m N + and b = 1, then the bound given by Theorem 12 is an upper bound of (5).

We use N + to represent the positive natural numbers. Likewise, for any m N + and b = 1, the procedure of GGD-Wo R algorithm is identical to the procedure of mini-batch SGD with size m. As for the theoretical bounds, they are not identical since if we substitute M1 = m/n, M2 = m(m 1)/n(n 1) into (8), then we have

E h x T x 2i (1 µγ)T x0 x 2 + 2γR

n 1 L + m(n m)

Apparently, the right side of (9) is an upper bound for the right side of (5). From Lemma 32 in Appendix, we know that if we further bound f(xk) 2 by 2 L f(xk) fmin , then we can derive the same bound as (9). Under this configuration, (8) gives the sub-optimal upper bound for the expected optimality gap, this bound can be greatly improved by some technical tricks as provided by proof of Theorem 4 in Appendix.

Proposition 15 If m = 1 and b = n, then the bound given by Theorem 12 is equivalent to (6).

Although the sampling probability given in Algorithm 1 and even the procedure of GGDWo R are different from that of SGD with importance sampling, the theoretical bound given by (8) and (6) are identical since M1 = M2 = 1. In that sense, GGD-Wo R with m = 1 and b = n is identical to SGD with importance sampling in expectation. These three special cases prove that GGD-Wo R can be regarded as a more general stochastic optimization framework. The following statement gives the complexity result for GGDWo R under strongly-convex assumption.

GGD: Grafting Gradient Descent

Corollary 16 If we choose stepsize

then to achieve ϵ-optimality, the total iteration number T should satisfy

2, M b2m2µ, 4RM ϵµ2b2m2

ln 2E x0 x 2

Hence the total complexity to achieve ϵ-optimality is

2, M b2m2µ, 4RM ϵµ2b2m2

ln 2E x0 x 2

The unified complexity result also include vanilla SGD, mini-batch SGD and SGD with importance sampling as special cases. For example, if m = 1, b = 1, then the complexity result will be O ((d/ϵ) ln(1/ϵ)) which is identical to that of vanilla SGD in other literatures (Gower et al., 2019). It is worth noting that although Corollary 16 give the right complexity of SGD with importance sampling, it does not give the proper complexity for GGD-Wo R itself. In contrast with one-shot sampling probability given by Zhao and Zhang (2015), the resampling probability PSmi f Smi(xk) f Smi,min , which is iterate-dependent, needs to be calculated in every iteration. Hence when m = 1, b = n, the computational cost for evaluating n loss function values in one iteration can not be ignored and thus for extremely large b, the complexity results of GGD should be the bound for partial derivatives and loss function evaluations. Under this definition, the complexity results of GGD with b = n and m = 1 should be

(n + d) max 1

2, L µ, 4R L

ln 2E x0 x 2

which is far larger than the bound obtained by Corollary 16. In analogy with GGD-WR, the noise variance can be further reduced by a diminishing stepsize sequence. For simplicity, we do not present the convergence and complexity results for GGD-Wo R using a diminishing stepsize sequence. Similar result can be obtained following the analysis provided in Theorem 6.

4.5 Convergence Result for GGD-Wo R under Convex Assumption

In analogy with Theorem 12, we can obtain a unified theoretical bound for GGD-Wo R under convex assumption.

Theorem 17 Suppose that the objective function f is L-smooth, the individual loss function fi is Li-smooth and convex, bounded below by fi,min for all i [n]. When Algorithm 1 is run with γ < b2m2/2M and option (b), denoting ˆx = 1

T PT 1 k=0 xk, then it satisfies

E [f(ˆx) f(x )] E x0 x 2

Feng and Zhou

Theorem 17 shows that under convex assumption, GGD-Wo R can converge at a sublinear rate up to some noise level. Likewise from Theorem 17, we can also obtain the convergence resuls for vanilla SGD, mini-batch SGD and SGD with importance sampling under convex assumption. For simplicity, we do not illustrate these results in detail. The corresponding complexity result for GGD-Wo R method is given as follows.

Corollary 18 If we take the stepsize which satisfies

2M min n 1, ϵ

to achieve ϵ-optimality, then the total iteration number T should satisfy

T 4ME x0 x 2

ϵb2m2 min{1, ϵ/2R}.

Thus the total complexity is

d 4ME x0 x 2

ϵb2m min{1, ϵ/2R}.

Corollary 18 shows that for m n, the complexity result for GGD-Wo R will be O(d/ϵ2) which is same as that of vanilla SGD, mini-batch SGD with same mini batch size m and SGD with importance sampling since these improved methods all shrink the noise variance of the stochastic sampled gradient by some constant factors and only improve some non-dominant terms in complexity.

4.6 Convergence Result for GGD-Wo R under Non-convex Assumption

Finally, we can formally give the convergence result for GGD-Wo R under non-convex assumption.

Theorem 19 Suppose that objective function f is L-smooth, the individual loss function fi is Li-smooth, bounded below by fi,min for all i [n]. When Algorithm 1 is run with option (b) and fixed stepsize γ ϵb2m2/2LMδ0 where ϵ > 0, then the iterates generated by Algorithm 1 satisfy

min k=0,1,...,T 1 E f(xk) 2 ϵ

where δ0 = E f(x0) fmin is a constant.

Theorem 19 shows that with a hand-picked stepsize, GGD-Wo R can still converge at a sublinear rate up to an arbitrary small noise level, which indicates that the minimum expected gradient norm can be effectively controlled by increasing iteration number T. The complexity results for GGD-Wo R can be directly derived from the above theoretical bound.

Corollary 20 With the stepsize

GGD: Grafting Gradient Descent

Algorithm 1 with option (b) can still achieve ϵ-optimality as long as the iteration number T satisfies

T 4LMδ2 0 ϵ2b2m2 .

Hence to achieve ϵ-optimality, the total complexity is

d 4LMδ2 0 ϵ2b2m .

For m n, the complexity result of GGD-Wo R will be O(d/ϵ2), which indicates that it is much slower for GGD-Wo R method to reach the ϵ-optimality without convexity and any extra asuumption on smoothness. Combining all the theorems and corollaries provided in Section 4, we find that although GGD-WR and GGD-Wo R have idiosyncratic noise levels with the fixed stepsize, their convergence rates and corresponding complexity results are identical under same assumptions: O((d/ϵ) ln(1/ϵ)) for strongly-convex objective function and O(d/ϵ2) for convex or non-convex objective function. These results suggest that the performances of GGD-WR and GGD-Wo R methods may be quite indistinguishable for identical m and b when solving the real problems.

5 Variance Reduction Method and Adaptive Stepsize Method

In Section 4, we know that to achieve ϵ-optimality, GGD-WR and GGD-Wo R methods rely on a small stepsize or diminishing stepsize sequence which results in a pretty slow convergence in practice. If we insist on using the fixed stepsize, then a better technique that reduces the variance with a fixed stepsize is required. Recalling that based on the SGD framework, lots of advanced variance reduction methods like SVRG can significantly improve the performance compared with the vanilla SGD method. In the first part of this section, we give one example to show the compatibility of variance reduction method and grafting gradient, and illustrate the convergence results of this composite method. If the fixed stepsize is abnegated, another way to accelerate the training process is through adaptive stepsize methods like Adam. We hybridize GGD-WR with Adam, propose GGDWR-Adam method and provide its convergence results with additional assumptions in the other part of this section.

5.1 GGD-WR-SVRG Method

The key idea of SVRG adopts from a variance reduction technique which is commonly used in sampling theory called control variates. In the SVRG method, control variates is used to modify the stochastic sample gradient so that SVRG method shows a linear convergence rate. Since the grafting gradient can play the same role as stochastic gradient, we can bring out a modified grafting gradient to update the parameters as well. The proposed GGD-WR-SVRG is shown in Algorithm 2. Option (a) was first proposed in Free-SVRG (Sebbouh et al., 2019) and pk, k {0, ..., q 1} are defined as follows.

k=0 (1 γµ)q 1 k and pk = (1 γµ)q 1 k

Vq , for k = 0, ..., q 1, (10)

Feng and Zhou

Algorithm 2: GGD-WR-SVRG method

Input: Batch size b, subsampled set size m, learning rate γ and update period q. Initialize: xq 0 and set x0 = xq 0, x0 1 = xq 0 for s = 1, 2, ..., T do x = xs 1 Compute µ = f( x) for k = 0, ..., q 1 do Sample Sb m = {Sm1, ..., Smb} with replacement from Dm

Compute PSmi = f Smi (xk s) f Smi ( x) Pb j=1 f Smj (xks) f Smj ( x) for i = 1, 2, ...b. Denote

P = (Pr1, ..., Prb)

Resample {Smr1, ..., Smrd} from Sb m based on the resampling distribution P Compute the grafting gradient as

1 m P i Smr1 fi x1 (x)

1 m P i Smrd

xk+1 s = xk s γ gm,b(xk s) gm,b( x) + µ xk s γ gk m,b

end Option (a): Set xs = Pq 1 k=0 pkxk s and set x0 s+1 = xq s Option (b): Set xs = xk s for randomly chosen k {0, ..., q 1} and set x0 s+1 = xq s Option (c): Set xs = xq s and set x0 s+1 = xq s end

where γ is the fixed learning rate and µ is the strong convexity constant. Among these three options, we prefer option (c) at the end of the inner loop since it is easier to implement and more intuitive for practitioners applying to real problems. These three options are all indispensable for the convergence analysis of the GGD-WR-SVRG method. Similar to SVRG, the bi-loop (inner loop and outer loop) structure and the modified grafting gradient gk m,b are the key ingredients of variance reduction property. Without the outer loop, the noise variance of the modified grafting gradient E gk m,b 2 deems to diverge. If PSmi = 1/b for Smi Sb m, then the noise variance of modified grafting gradient E gk m,b 2 equals to the noise variance of mini-batch stochastic variance reduced gradient E f Smri (xk s) f Smri ( x) + µ 2 proposed by Johnson and Zhang (2013). This result suggests that since the resampling distribution P provided in Algorithm 2 is optimal in sense of minimizing E gk m,b 2, the noise variance of the modified grafting gradient is further reduced compared with the original mini-batch stochastic variance reduced gradient. It is

GGD: Grafting Gradient Descent

also worth noting that using resampling probability given in Algorithm 2 results in a heavier computational burden as we have to calculate bmd partial derivatives in one iteration. As always, the convergence result of GGD-WR-SVRG under strongly-convex assumption is provided at the first place following the analysis provided by Sebbouh et al. (2019).

Theorem 21 Suppose that the objective function f and the individual loss function fi are L-smooth and fi is µ-strongly convex for all i [n]. When Algorithm 2 is run with option (a), the fixed stepsize γ < 1/16L and the update period q, then the iterates generated by the outer loop satisfy E xq s x 2 ρs(1 + 12L2γ2Vq)E xq 0 x 2,

ρ = max (1 γµ)q , 1

Theorem 21 indicates that the iterates generated by GGD-WR-SVRG converge to the optimal point x at a linear rate since ρ 1/2. The bound we obtained here is comparable to those in Johnson and Zhang (2013) and Sebbouh et al. (2019). Denote by κ = L/µ the conditional number of objective function f. We next give the total complexity result of GGD-WR-SVRG method.

Corollary 22 If we set the stepsize γ = 1/16L and the update period q = n. Then to achieve ϵ-optimality, the iteration number for the outer loop T should satisfy

n , 2 ln (64 + 3Vq)E xq 0 x 2

thus the total complexity is

2(1 + bm)d max {8κ, n} ln (64 + 3Vq)E xq 0 x 2

From Theorem 21 and Corollary 22, we can see that GGD-WR-SVRG method can achieve ϵ-optimality with a pretty large fixed stepsize 1/16L compared with the stepsize used in Corollary 5. But this improvement comes with a price which is that we have to evaluate the full gradient f(x) once at the begining of the outer loop. This evaluation influences the complexity which is now related to the training set size n. In other words, when using a large data set, GGD-WR-SVRG algorithm may take longer time to achieve ϵ-optimality. Supposing that m q and b q, then the complexity will be O ((n + κ)d ln(1/ϵ)) which is same as the complexity of the original SVRG method. The convergence results can also be obtained in the general convex case.

Theorem 23 Suppose that the objective function f and the individual loss function fi are L-smooth. When Algorithm 2 is run with option (b) and the fixed stepsize γ < 1/10L. Denoting ˆx = 1 q T PT s=1 Pq 1 k=0 xk s, then it satisfies

E [f(ˆx) f(x )] P 0

2q Tγ(1 10Lγ)

where P 0 = E x0 x 2 + 12Lγ2q E [f( x0) f(x )].

Feng and Zhou

The GGD-WR-SVRG algorithm has a sublinear convergence rate in the general convex case with the fixed stepsize. We can also find that if the outer loop did not exist, then the convex GGD-WR-SVRG would converge at sublinear rate up to some noise level like the convergence results of original GGD methods. This result suggests that the bi-loop structure is responsible for the variance reduction property. From Theorem 23, we can give the total complexity as follows.

Corollary 24 If we set the stepsize γ = 0.05/L and the update period q = n, the iteration number of the outer loop T to achieve ϵ-optimality should satisfy

T max 40LE x0 x 2

nϵ , 1.2E [f( x0) f(x )]

then the complexity is

nd(1 + bm) max 40LE x0 x 2

nϵ , 1.2E [f( x0) f(x )]

If the conditional number κ is not ill-conditioned, b n and m n, then the complexity result is O(nd ln 1/ϵ) for strongly convex functions and O(nd/ϵ) for general convex functions with relatively large stepsize γ = 0.05/L. It is also worth noting that if we set the fixed stepsize γ = 1/n1/2 and the update period q = n, then from Theorem 23, we can derive an improved bound for the complexity, that is, O(n1/2d/ϵ) for m n, b n. Recall the complexity results for the GGD method with small fixed stepsize which is dependent on ϵ: O((d/ϵ) ln 1/ϵ) for strongly convex functions and O(d/ϵ2) for general convex functions. We can see that if the training set size n is not extremely large, GGD-WR-SVRG may require less iterations to achieve ϵ-optimality with fixed stepsize. Finally we state the convergence result for non-convex GGD-WR-SVRG.

Theorem 25 Suppose that the objective function f and the individual loss function fi are L-smooth, the objective function f is also bounded below by fmin. Define a decreasing sequence {ηk}q k=0,

ηk = 3γ2L3 + ηk+1(6γ2L2 + 1 + τγ), with ηq = 0, (11)

where fixed stepsize γ and constant τ > 0 satisfying

τ > η0 1 2γL 4γη0 .

When Algorithm 2 is run with option (c) and fixed stepsize, then the iterates generated by Algorithm 2 satisfy

min k=0,1,...,q 1 s=1,2,...,T E f(xk s) 2 E f(x0 1) fmin

q Tγ(1 2γL 4γη0 η0/τ).

Theorem 25 indicates that as q T increasing, the minimum expected square L2-norm of the full gradient can not stay bounded away from zero, which implies that the non-convex GGD-WR-SVRG has a sublinear convergence rate. With proper choice of γ and τ, we can derive the complexity result for non-convex GGD-WR-SVRG.

GGD: Grafting Gradient Descent

Corollary 26 If we set the stepsize γ = ψ/(Ln2/3), constant τ = L/n1/3 and update period q = n/7ψ , where ψ is a constant satisfying

( 1 12(e 1), 1

n2/3 + 12(e 1)

then to achieve ϵ-optimality, the iteration number of outer loop T should satisfy

T 14LE [f( x0) fmin]

Thus the total complexity is

d (n + n/7ψ mb) 14LE [f( x0) fmin]

From Corollary 26, we can see that non-convex GGD-WR-SVRG can still achieve ϵ-optimality with a stepsize γ = ψ/(Ln2/3) which is related to the training set size n. The complexity will be O(dn2/3/ϵ) for m n, b n. Comparing Corollary 26 with Corollary 24, we can find that when convex assumption does not hold, the complexity will degrade from O(dn1/2/ϵ) to O(dn2/3/ϵ) which is still better than O(dn/ϵ), the complexity result for convex GGD-WR-SVRG with fixed stepsize that is independent of n. This result suggests that a stepsize which is dependent on n may be more proper to use in GGD-WR-SVRG algorithm for convex or non-convex cases.

5.2 GGD-WR-Adam Method

Although variance reduced GGD-based method can achieve a linear convergence rate under strongly-convex assumption, it is potentially burdened by the expensive computational cost and the inadequacy that the iterates generated by variance reduction methods are prone to be stuck in the local minima. As mentioned before, adaptive stepsize methods can accelerate the training process and empirically outperform the original SGD method. Luckily, GGD is also compatible with adaptive stepsize methods. In the rest of this section, we propose GGD-WR-Adam which builds upon Adam and replace the stochastic sample gradient with grafting gradient using sampling with replacement. Before diving into the detail of GGDWR-Adam algorithm, we first give one additional assumption.

Assumption 27 The L -norm of the grafting gradients is uniformly almost sure bounded, i.e., there is a constant R σ so that

gm,b(x) R σ, for all x Rd a.s..

This assumption is essential to the convergence analysis of GGD-WR-Adam. σ is used to simplify the final bound as remarked in D efossez et al. (2020). In Algorithm 3, g2 k = gk gk indicates the element-wise square, and for a sequence of vectors {νk}, we denote νk,(i) the i-th component of k-th vector in this sequence. We also assume that we have an access to the oracle GGDm,b, i.e., the first few steps in GGD-WR algorithm, which can provide i.i.d grafting gradient samples given m, b, and xk. Good

Feng and Zhou

Algorithm 3: GGD-WR-Adam method

Input: Batch size b, subsampled set size m, learning rate γ, an orcale GGDm,b, exponential decay rate for the first moment estimates β1, exponential decay rate for the second moment estimates β2 and σ. Initialize: x0 = 0, h0 = 0 and v0 = 0. for k = 1, 2, ..., T do gk = GGDm,b(xk 1) hk = β1hk 1 + gk vk = β2vk 1 + g2 k γk = γ 1 β1 (1 β2)1/2 (1 βk 2 )1/2

1 βk 1 for i = 1, 2, ..., d do Update: xk (i) = xk 1 (i) γk hk,(i) (σ+vk,(i))1/2 end end

default settings for the hyperparameters are b = 2, m = 2k for k N +, γ = 0.001, β1 = 0.9, β2 = 0.999 and σ = 10 8 which is used for numerical stability. Following the analysis provided by (D efossez et al., 2020), we only present the theoretical bound and complexity result for non-convex GGD-WR-Adam method.

Theorem 28 Suppose that Assumption 27 holds, the objective function f and the individual loss function fi are L-smooth, the objective function f is bounded below by fmin. When Algorithm 3 is run with the fixed stepsize γ > 0, 0 < β2 < 1, 0 β1 < β2, then for any T N + such that T > β1/(1 β1), the iterates generated by Algorithm 3 satisfy

E f(xω) 2 2RE f(x0) fmin

where T = T β1/(1 β1),

J = γd RL (1 β1)(1 β2)(1 β1/β2) + γ2d L2

2(1 β2)3/2(1 β1)5/2(1 β1/β2)

(1 β1)3/2(1 β2)1/2(1 β1/β2)3 ,

and ω is a random index taking values from {0, 1, ..., T 1} with probability

k N, k < T, P(ω = k) 1 βT k 1 .

The bound derived in Theorem 28 is different from the bound given by D efossez et al. (2020) since we do not leave the corrective term for the first moment estimates as the original Adam algorithm. This theoretical bound seems too complicated to acknowledge that GGD-WR-Adam can converge with careful choice of hyperparameters. So we first give the complexity result and then bring out some discussions about these results. If we set

GGD: Grafting Gradient Descent

T, β2 = 1 1/T, and assuming that β1/(1 β1) T and β1/β2 β1 (These two assumptions can easily hold when iteration number T is extremely large), then the theoretical bound given in Theorem 28 can be approximated by

E h f (xω) 2i 2RE[f x0 fmin]

γd RL (1 β1)2 + γ2d L2

2(1 β1)7/2 + 6d R2

K = γd RL (1 β1)2 + γ2d L2

2(1 β1)7/2 + 6d R2

then we can obtain

E h f (xω) 2i 2RE[f x0 fmin]

Corollary 29 To achieve ϵ-optimality of non-convex GGD-WR-Adam method, for some constant ϕ (0, 1/2), the iteration number T should satisfy

( 36R2 Ef(x0) fmin 2

2 1 2ϕ 1 + R2

Then the total complexity is

( 36R2 Ef(x0) fmin 2

2 1 2ϕ 1 + R2

Now we put some remarks on the theoretical results derived in Theorem 28 and Corollary 29. For m n, non-convex GGD-WR-Adam achieves ϵ-optimality albeit with the complexity of O(d1+2/(1 2ϕ)/ϵ2/(1 2ϕ)) which is larger than the complexity of non-convex GGD since non-convex GGD-WR-Adam has a slower convergence rate O(ln(T)/

T). Recalling that we assume that there is a uniform almost sure bound for the L -norm of grafting gradients and apply anisotropic stepsizes to each dimension of model parameters, parameter dimension d is introduced into theoretical bounds and the complexity result is dependent of d1+2/(1 2ϕ)

as d R2 is a natural bound for the L2-norm of grafting gradient. It is also noteworthy that β1 does not play an important role in the theoretical bounds as it is regarded as a constant that can be absorbed by the increasing iteration number T. However, β1 serves the purpose of deciding random index ω crucially. For one hand, β1 > 0 implies that Algorithm 3 is run with heavy-ball style momentum. The closer β1 approaches 1, the less momentum hk changes in one iteration, that is, the last few grafting gradients barely influence the direction of the momentum. So the first few iterations are more likely to be selected since they are cumulated through time and are more important in deciding the direction of momentum for β1 1. On the other hand, β1 = 0 implies that ω is uniformly picked from {0, 1, ..., T 1}. This is expected as well since β1 = 0 also implies that there is no momentum and every iterate contributes evenly to the direction of update.

Feng and Zhou

Data set Dim ntr (train) Sparsity nte (test) L max L κ

covtype 54 290,506 22.22% 290,506 1.2258 1.8921 3.56102 105

ijcnn1 22 49,990 59.09% 91,071 0.3763 0.9842 1.88112 104

a9a 123 32,561 11.28% 9,865 3.4673 3.5000 1.12898 105

rcv1 47,236 20,242 0.1549% 677,399 0.2441 0.2500 4.94107 103

Table 1: Summary of data sets

6 Experiment Results

Our empirical results are presented in this section. We evaluate the performance of grafting gradient based algorithms on solving strongly-convex and non-convex problems, and compare their performance with vanilla SGD, SGD with importance sampling, mini-batch SGD, variance reduction method SVRG and adaptive stepsize method Adam.

6.1 Binary Classification Problems

We first run experiments on the L2-regularized logistic regression problem given by

1 + exp a i x

+ (1 bi) ln

1 + exp a i x

where (ai, bi) Rd {0, 1}, i = 1, ..., ntr are the data samples from covtype, ijcnn1, a9a and rcv1. All data sets are available on http://www.csie.ntu.edu.tw/~cjlin/libsvm, and are widely used in other literatures (Nguyen et al., 2017; Qian et al., 2019; Sebbouh et al., 2019; Mishchenko et al., 2020; Huang et al., 2021; Malinovsky et al., 2021). Relevant statistics of data and loss function are summarized in Table 1. In Table 1, Dim denotes the features number of the training data, ntr, nte are the number of data used for training and testing respectively, sparsity is the proportion of non-zero values in training data features. L is the average of smoothness constants, max L is the maximum of smoothness constants and κ is the conditional number of objective function f. They can be calculated explicitly since the loss function of L2-regularized logistic regression problem is µ-strongly convex with µ = λ and Li-smooth which admits a closed form expression Li = ai 2/4 + λ. For ijcnn1, a9a and rcv1, we use the predefined training set and testing set. covtype does not have a testing set. In that case, we randomly split the data set into the training set and the testing set with 50% for training and 50% for testing. The penalty parameter λ is set to be 1/ntr for all the experiments on different data sets. Since the stepsizes for stochastic optimization method are critical, we adopt the popular tinverse learning schedule γk = γ0(1 + γd k/ntr ) 1 (Johnson and Zhang, 2013; Reddi et al., 2016), where k is the iteration number and γ0, γd are chosen so that the corresponding algorithms give the best performance. When a fixed stepsize is used, we set γd = 0. For the methods which use grafting gradients to update the parameters in one iteration, unless specifying, the size of subsampled set m is set to be 16 for ijcnn1, a9a and rcv1 and 256 for covtype since its size is way bigger than other training sets. The batch size b used in the

GGD: Grafting Gradient Descent

grafting gradient based methods is set to be 2 since b = 2 gives the best performance and saves the most computational cost.

In the practical implementation, since most machine learning libraries calculate partial derivatives through backpropagation and the chain rule, to construct the grafting gradient, we actually obtain the entire b m d partial derivatives through backpropagation and discard the unused partial derivatives. This implementation implies that in one iteration, the actual computational complexity of GGD is approximately b times larger than that of mini-batch SGD with the same mini-batch size m, and b m times larger than that of vanilla SGD. So for fair comparison, the total iteration number of vanilla SGD and SGD with importance sampling is set to be b m larger than that of GGD and the mini-batch size of mini-batch SGD, SVRG and Adam is set to be b m in all these experiments to maintain the total computational complexity at the same level.

Figure 2: Comparisons of the train loss between vanilla SGD and GGD methods on ijcnn1, a9a, covtype and rcv1.

Feng and Zhou

Comparison between vanilla SGD and GGD. In all these experiments, we compare the performances of different methods in terms of train loss, the square L2-norm of the full gradient and the test loss (results are listed in Appendix). The horizontal axis of these figures denotes the number of effective passes. Usually, one effective pass over data is considered as computing one full gradient or evaluating ntr times gradients (total ntr d partial derivatives). Suffix -as means that the diminishing stepsize sequences are used in the algorithm. Take a deep look at Figure 2. Compared with the stochastic sampled gradient, the improvements brought by the grafting gradient mainly lie in two aspects: One is that the iterates generated by GGD can sometimes decrease faster than vanilla SGD and the other is that the iterates can fluctuate in a small neighborhood of the optimum point. For the former one, since GGD which uses a batch of subsampled sets to update the parameters, compared with the best-tuned SGD, the best-tuned GGD can fit a larger fixed stepsize which possibly results in a faster convergence as shown in Figures 2(a), 2(c) and 2(d). For the latter

Figure 3: Comparisons of the square L2-norm of full gradient f(xk) 2 between vanilla SGD and GGD methods on ijcnn1, a9a, covtype and rcv1.

GGD: Grafting Gradient Descent

Figure 4: Comparisons of the train loss between GGD-as, mini-batch SGD-as and SGD-as with importance sampling methods on ijcnn1, a9a, covtype and rcv1.

one, it coincides with the theoretical result derived in Theorem 4 that the noise variance of grafting gradient is reduced by a constant factor. Figure 3 presents the comparison in terms of f(xk) 2. We can observe that although the comparison is less clear, GGD with diminishing stepsize sequence still outperforms its competitors in Figures 3(a) and 3(b) as it achieves a lower value of f(xk) 2, which implies that the iterates generated by GGD with diminishing stepsize sequence are more closer to the optimum point due to the strong convexity of the objective function. From Figures 3(c) and 3(d), we can see after about 10 epoches of training, the iterates generated by GGD with diminishing stepsize sequence fluctuate more smoothly compared with its competitors, which demonstrate the variance reduction property possessed by GGD method. In a short word, these results suggest that using GGD can be more beneficial than using vanilla SGD for solving L2-regularized logistic regression problems.

Feng and Zhou

Figure 5: Comparisons of the square L2-norm of full gradient f(xk) 2 between GGD-as, mini-batch SGD-as and SGD-as with importance sampling methods on ijcnn1, a9a, covtype and rcv1.

Comparison among GGD, MBSGD and SGD-IS. Figures 4 and 5 report the performance of GGD, mini-batch SGD (mbsgd) and SGD with importance sampling (sgdis). SGD-IS is run without minibatching, that is, using one gradient sampled from population with probability Pri Lri. For the performances of mini-batch SGD with importance sampling, they are presented in Appendix C. In view of the algorithms with fixed stepsize being inferior to the algorithms with diminishing stepsize sequence, we do not report the performance of GGD, MBSGD and SGD-IS with fixed stepsize. It is clear that although SGD-IS may outperform MBSGD as shown in Figures 4(a), 4(b) and 5(b), and vice versa such like Figure 4(c), the performances of GGD algorithm with diminishing stepsize sequence are pretty robust and close to the better one out of MBSGD and SGD-IS. For example in Figure 4(a), GGD, MBSGD and SGD-IS show a same decreasing rate for the first few epoches, but they begin to differ after three epoches.

GGD: Grafting Gradient Descent

Figure 6: Comparisons of the train loss between GGD-WR-Adam and Adam methods on ijcnn1, a9a, covtype and rcv1.

SGD-IS decreases faster than MBSGD and GGD, and eventually achieves a lower train loss. Although GGD performs quite similarily to MBSGD in the first few epoches, it gradually outperforms MBSGD and catches up with SGD-IS. Robustness can also be confirmed by Figure 4(c) where MBSGD and GGD decrease in a much stable manner compared with SGD-IS. These results confirm the doubly robust property which we have proven in Section 4 and empirically show that GGD can obtain robust and comparable results when training the logistic regression models with L2-regularization. Comparison between Adam and GAdam. To compare the performance of Adam and GGD-WR-Adam (gadam) methods, we use minibatching technique to update the parameters both for Adam and GAdam, set fixed learning rates and the hyperparameters β1,

Feng and Zhou

Figure 7: Comparisons of the square L2-norm of full gradient f(xk) 2 between GGDWR-Adam and Adam methods on ijcnn1, a9a, covtype and rcv1.

β2 and σ are set by default for these two methods. From the results in Figure 6, we can see that in most cases, the difference between the performances of Adam and GAdam is quite nuanced especially for rcv1 data set where descent curves overlap each other. GAdam and Adam both show a similar decreasing rate and achieve a low level of train loss except for a9a data set. In Figure 6(b), the iterates of GAdam algorithm decrease in a much stable manner compared with Adam method. Recalling that SGD-IS empirically outperforms MBSGD on a9a data set, we can infer that since using importance sampling technique may be more beneficial than using mini-batching technique for training L2-regularized logistic regression model on a9a data set.

GGD: Grafting Gradient Descent

Figure 8: Comparisons of the train loss between GGD-WR-SVRG and SVRG methods on ijcnn1, a9a, covtype and rcv1.

From Figures 7(b), 7(c) and 7(d), we can see that in terms of f(xk) 2, the iterates generated by Adam fluctuate more greatly than the iterates generated by GAdam. These results indicate that empirically the noise variance of GAdam may be smaller than the noise variance of Adam.

Comparison between SVRG and GSVRG. In these experiments, GGD-WR-SVRG (gsvrg) is competing against the mini-batch SVRG. The update period q is set to be 0.26n for ijcnn1, 1.95n for a9a, 0.62n for covtype and 0.14n for rcv1 as suggested in Sebbouh et al. (2019). We can see that GSVRG outshines the original mini-batch SVRG method except for covtype data set where GSVRG only holds a slender lead and the improvement is pretty much negligible. Results in Figure 8 show that GSVRG can achieve a lower train loss compared with SVRG method for ijcnn data set. For a9a and rcv1 data sets, although the performances of GSVRG are a liitle worse than the performances of SVRG at the beginning,

Feng and Zhou

Figure 9: Comparisons of the square L2-norm of full gradient f(xk) 2 between GGDWR-SVRG and SVRG methods on ijcnn1, a9a, covtype and rcv1.

the decreasing rate of GSVRG gradually catches up with SVRG and becomes a bit faster than SVRG in the last few epoches. In terms of f(xk) 2, there is no significant difference between the performances of GSVRG and SVRG except that the comparison in a9a data set is quite discernible. Results in Figures 4(b), 5(b), 7(b) and 9(b) suggest that methods, which assign non-uniform sampling probability to data sample, may outperform these with uniform sampling probability for training L2-regularized logistic regression model on a9a data set.

6.2 Multiclass Classification Problems

We then run experiments to solve the multiclass classification problems via training convolution neural networks which is one representative non-convex problem encountered in machine learning. We use two common data sets, MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky et al., 2009) to train two convolution neural networks with different structures.

GGD: Grafting Gradient Descent

For the former one, we train the classic Le Net-5 (Le Cun et al., 1998) with minimal modification, which consists of two convolution layers with batch normalization and relu activation function, two fully-connected layer with relu activation function and one fully-connected layer with softmax activation function to output the predicted values. For the latter one, we use the cifar10-nv architecture proposed by Gitman and Ginsburg (2017) which achieves close to the state-of-the-art performance in less training time. The complete network architectures are presented in Appendix C. L2-regularization are used for preventing overfitting in these experiments and the penalty parameter λ is set to be 10 4. Features in data sets are normalized to the interval [0, 1] for all the experiments and the images from MNIST are resized into 32 32 to fit the Le Net-5 architecture. Since mini-batch SGD is much more efficient than vanilla SGD in neural network training, we do not present the result of vanilla SGD methods. For the convolution neural network training, the L-smoothness constant of individual loss function is not available and hence the result of SGD with importance sampling is not provided either. In previous experiments, we empirically show that the methods with diminishing stepsize sequence converge much better than those methods with fixed stepsize. Hence only the algorithms with diminishing stepsize sequence are compared in solving multiclass classification problems. We train all the networks with the subsampled set size m = 128 for GGD based methods. For fair comparison, the mini-batch size is set to be 256 for mini-batch SGD, SVRG and Adam as b = 2. We adopt the linear decay learning rate which is defined as

γk = γ0 + γT 1 γ0

where γT 1 is the final learning rate, γ0 is the initial learning rate, T 1 is the total number of effective passes and k is the current number of effective passes. The final learning rate is 10 5 and the initial learning rate is 10 2 for MBSGD and GGD. The final learning rate is 10 6 and the initial learning rate is 10 4 for Adam and GAdam. As suggested in Section 5, the final learning rate is 1/n2/3 tr and the initial learning rate is 100/n2/3 tr for SVRG and GSVRG methods. The batch size b for grafting gradient based methods is set to be 2 and the update period for variance reduction methods q are set to be 3ntr/m. Since Katharopoulos and Fleuret (2018); Johnson and Guestrin (2018); M uller et al. (2019) all show that putting non-uniform sampling probability on data samples can benefit the neural network training, we hope that grafting gradient based methods can outperform stochastic gradient based methods for training CNNs. Experiment results are presented in Figures 10 and 11. From Figure 10, we can see that due to the importance resampling technique, train loss obtained by GGD decreases in a less fluctuating way especially at the beginning of the training process when the relatively large stepsizes are used. As the number of effective passes grows, GGD eventually outperforms and hold a considerable lead over MBSGD. The comparison between GAdam and Adam is more obvious as Adam does not procede GAdam during the entire training process. As for the comparison between GSVRG and SVRG, although result in Figure 10(e) indicates that the performances of GSVRG and SVRG are quite indistinguishable before 80 epoches, GSVRG gradually emerges as a more competitive stochastic optimization method for training Le Net-5 on MNIST data set near the end of training process. In conclusion, GGD, GAdam and GSVRG all achieve a lower train loss compared with MBSGD, Adam and SVRG respectively.

Feng and Zhou

Figure 10: Comparisons of the train loss (left) and the square L2-norm of full gradient f(xk) 2 (right) between MBSGD and GGD, Adam and GAdam, SVRG and GSVRG methods on MNIST.

GGD: Grafting Gradient Descent

Figure 11: Comparisons of the train loss (left) and the square L2-norm of full gradient f(xk) 2 (right) between MBSGD and GGD, Adam and GAdam, SVRG and GSVRG methods on CIFAR-10.

Feng and Zhou

Methods Adam GAdam MBSGD GGD SVRG GSVRG scale

Values 5.1588 5.2779 5.3185 5.8986 0.13407 0.15203 10 3

Table 2: Minimum square L2-norm of full gradient on MNIST

Methods Adam GAdam MBSGD GGD SVRG GSVRG scale

Values 1.6565 1.4659 634.79 2681.0 2.0849 3.7647 10 4

Table 3: Minimum square L2-norm of full gradient on CIFAR-10

For CIFAR-10 data set, it is intriguing if we take a deep look at Figures 11(a) and 11(c), we will find that the descent curves of objective function values have a same pattern, that is, they all move down slowly, become flattened and slump at some epoches. Figures 11(b) and 11(d) also reflect this descent pattern to a certain extent as f(xk) 2 becomes more fluctuated before and after the epoches where the slump happens. In our opinion, the cause behind this descent pattern may be the complexity of network architecture. As for the comparison between different methods, Figure 11(a) shows that after the first few epoches, GGD with linear decay learning rate outperforms MBSGD completely. Even for some train losses of the GGD method before the slump, they are slightly smaller than the minimum train loss found by MBSGD. The comparisons between Adam and GAdam, SVRG and GSVRG are more clear as the descent curves of Adam and SVRG lie above the descent curves of GAdam and GSVRG almost entirely during the training process. Another interesting fact is that although GSVRG and SVRG has a much lower f(xk) 2

as expected, the lowest train loss is obtained by the adpative stepsize method, GAdam. This result suggests that one critical problem with the variance reduction methods in minimizing non-convex objective functions is that due to the lack of randomness, the iterates generated by variance reduction methods are likely to be trapped in local minimum. It may also explain why there is no slump in Figure 11(e).

It is worth noting that the comparisons in terms of f(xk) 2 are both ambiguous on these two data sets except for Figures 10(b) and 10(d) where grafting gradient based methods obtain a lower f(xk) 2 for the most time. Recalling that we derive the theoretical bounds for min E f(xk) 2 under non-convex assumption, these minimum values are also reported in Tables 2 and 3 for comparison. From these results, we can see that although the minimum values obtained by grafting gradient based methods are larger than that of stochastic sampled gradient based methods (MBSGD, Adam, SVRG), the train losses obtained by grafting gradient based methods are lower. In other words, stochastic sampled gradient based methods may converge much closer to some stationary points and grafting gradient based methods may converge less closer to some stationary points but with a lower objective function value. Combining all these empirical results, we can conclude that using grafting gradient to update the parameters is more robust and promising for empirical risk minimization. Moreover, for training CNNs on MNIST and CIFAR-10 data sets, introducing importance resampling technique can further improve the performance of original stochastic optimization methods.

GGD: Grafting Gradient Descent

7 Conclusions and Discussions

We propose a novel stochastic optimization method which employs importance resampling and constructs grafting gradient to update the model parameters. Based on the different sampling techniques, GGD-WR and GGD-Wo R are proposed. For the former one, we prove that the grafting gradient using sampling with replacement possesses a doubly robust property which ensures that the performance of GGD-WR will fall between the performances of mini-batch SGD and SGD with importance sampling in sense of expectation. For the latter one, we show that GGD-Wo R can be regarded as a more generalized stochastic optimization framework since it includes vanilla SGD, mini-batch SGD and SGD with importance sampling as special cases. Under different assumptions, we provide the convergence analysis for GGD-WR and GGD-Wo R methods. Compared with the vanilla SGD method, GGD reduces the noise variance by a constant factor and has a better performance both theoretically and empirically. Compared with mini-batch SGD and SGD with importance sampling, results in Sections 4 and 6 show the unique robustness property possessed by GGD methods. Based on the grafting gradient, we further combine it with high-level variance reduction technique and adaptive stepsize method to improve upon the original GGD methods. The theoretical results of GGD-WR-SVRG and GGD-WR-Adam are presented and the empirical results of GGD-WR-SVRG and GGD-WR-Adam show that they are doing great jobs for solving strongly-convex or non-convex problems. It is worth noting that the performances of coordinate descent and its variants are not compared in Section 6 as they are less relevant to GGD in two counts. One is that grafting gradient updates the parameters in a like manner as SGD and its variants because they update all the parameters in one iteration instead of updating one parameter while keeping all other fixed. On the other hand, a CD is not guaranteed to converge when applied to minimize any given continuously differentiable function. Powell (1973) gave an example of a non-convex continuously differentiable function of three variables where a cyclic CD can not converge to a solution. On the contrary, gradient based methods, including GGD, SGD even gradient descent are guaranteed convergence to a stationary point when objective function is non-convex. Hence we only compare GGD with the most relevant methods, that is, vanilla SGD, mini-batch SGD and SGD with importance sampling in Section 6.

The grafting gradient based methods can be improved in several directions. First, for GGD-WR-SVRG, the optimal resampling probability given in Algorithm 2 is more computational expensive than the resampling probability given in Algorithm 1. It will be more satisfactory if an approximate resampling probability which is defined in terms of objective function values instead of gradient function is adopted in Algorithm 2 with provable theoretical guarantees. Second, although GGD-WR-SVRG empirically outperforms the original SVRG, the theoretical bound derived in Theorem 21 is worse than that of mini-batch SVRG for any values of update period and stepsize. This contradiction indicates that the theoretical bound of GGD-WR-SVRG may be further improved by some technical tricks. Third, GGD-WR-SVRG and SVRG both show that they are less likely to escape from the stationary point. It will be more favorable if some modifications can be made to the procedure of GGD-WR-SVRG so that the additional randomness injected by importance resampling can help the iterates escape from the stationary point.

Feng and Zhou

An promising extension of GGD may be the implementation of GGD in federated learning. In the centralized federated learning (Blanchard et al., 2017; Yin et al., 2018), a trustworthy parameter central server is used to orchestrate all the participating clients and update the parameters with gradients received from clients. Due to the limits in network bandwidth and computing power, sending a complete gradient may be time-consuming. Hence intuitively using grafting gradient update in central parameter server may be more friendly to those clients with low-quality device. The convergence results of federated GGD require further discussions as one fundamental challenge in federated optimization method is the presence of non-IID data.

Acknowledgments and Disclosure of Funding

The authors thank the action editor Prof. Moritz Hardt and the referee for their valuable comments that greatly improved the presentation of the paper. This work was partial supported by the National Natural Science Foundation of China (12131001), the Fundamental Research Funds for the Central Universities, LPMC, and KLMDASR. Yongdao Zhou is the corresponding author.

GGD: Grafting Gradient Descent

Appendix A. Proofs of Theorems and Corollaries in Section 4

In this appendix we prove the following theorems and corollaries in Section 4. We first prove one useful lemma which are important to the convergence analysis.

Lemma 30 (Nesterov, 2003) If the function f is L-smooth, then

f(x) f(y) + f(y), x y + L

2 x y 2, (13)

f(x) f(y) + f(y), x y + 1

2L f(x) f(y) 2, (14)

where x, y = x y denotes the inner product of two vectors x and y.

A direct result of (13) is that if we let x = y f(y) /L and assume that f is bounded below by fmin, then

fmin f(y f(y) /L) f(y) f(y) 2

Rearranging the terms, we have

f(y) 2 2L (f(y) fmin) , (15)

which is quite useful in our convergence analysis. The following Lemma 31 states the relationship between smoothness constant and strong convexity constant, which is useful in the proof of GGD-WR-SVRG.

Lemma 31 If the function f is both L-smooth and µ-strongly convex, then we have L µ.

Proof From Definition 1, if x is the global minimizer of f, we have

f(x) f(x ) + µ

Due to the L-smoothness of f, from Theorem 2.1.5 in Nesterov (2003), we know that

f(x) f(x ) L

Combining these two results, we have

2 x x 2 f(x) f(x ) L

We prove Lemma 31 as x x 2 is non-negative.

Feng and Zhou

A.1 Proof of Theorem 4

Proof Unless specifying, we write E[ | xk] as E[ ] for convenience throughout the rest of this paper. We first show that the grafting gradient gm,b(xk) is an unbiased estimator of f(xk) conditional on xk. Without loss of generality, we take a look at the first dimension of the grafting gradient gm,b(xk),

E 1 b PSmr1

" 1 b PSmr1

which proves that the grafting gradient is an unbiased estimator of f(xk) since Egm,b(xk) = f(xk). The fourth equality holds because {Sm1, ..., Smb} are sampled independently from Dm. For the most gradient-based stochastic optimization algorithms, it makes sense to study the following recursion,

E xk+1 x 2 = E xk x 2 2γ xk x , f(xk) + γ2E gm,b(xk) 2

(1 µγ) xk x 2 2γ f(xk) f(x ) + γ2E gm,b(xk) 2. (16)

The first inequality relies on the strong convexity of objective function. The last term in (16) can be rewritten as

E gm,b(xk) 2 = E

1 PSmi f Smi(xk) 2 #

f Smi(xk) 2

f Smi(xk) f Smi,min

i=1 f Smi(xk) f Smi,min

i=1 f Smi(xk) 2 !#

f Smq (xk) 2

f Smq (xk) f Smq ,min

f Smp(xk) f Smp,min

b E f Sm1(xk) 2 + b 1

" f Smq (xk) 2

f Smq (xk) f Smq ,min

f Smp(xk) f Smp,min #

b E f Sm1(xk) 2 + b 1

b E 2LSm1 E h f Sm1(xk) f Sm1,min i . (17)

The first inequality holds since (15) and subsets are sampled independently. Now we turn to deal the last term in (17) respectively. Noting that E f Sm1(xk) 2 is the noise variance of mini-batch SGD, we first derive an upper bound for this noise variance.

GGD: Grafting Gradient Descent

Lemma 32 For a subset Sm1 Dm. Given xk, we have

j Sm1 fj(xk) 2

= n m m(n 1) 1

j=1 fj(xk) 2 + n(m 1)

m(n 1) f(xk) 2. (18)

Proof The left side of (18) is

j Sm1 fj(xk) 2

j Sm1 fj(xk) 2 + X

p =q fp(xk) fq(xk)

j=1 fj(xk) 2 + m(m 1)

p =q f p (xk) fq(xk)

j=1 fj(xk) 2 + m 1

p =q f p (xk) fq(xk)

j=1 fj(xk) 2 + n m

j=1 fj(xk) 2

= n m m(n 1) 1

j=1 fj(xk) 2 + n(m 1)

m(n 1) f(xk) 2,

which proves Lemma 32.

Now we can derive the upper bound of the noise variance of mini-batch SGD given xk and size m.

j Sm1 fj(xk) 2

= n m m(n 1) 1 n

j=1 fj(xk) 2 + n(m 1)

m(n 1) f(xk) 2

j=1 2Lj fj(xk) fj,min

f(xk) f(x )

f(xk) f(x ) + R

f(xk) f(x )

= 2 (Lmax(n m) + Ln(m 1))

f(xk) f(x )

+ 2Lmax R(n m)

= C f(xk) f(x ) + 2Lmax R(n m)

m(n 1) . (19)

Feng and Zhou

The first inequality holds due to (15). E 2LSm1 E f Sm1(xk) f Sm1,min is equivalent to

E 2LSm1 E h f Sm1(xk) f Sm1,min i = 2 L f(xk) f(x ) + 2 LR. (20)

Substituting (19) and (20) into (17), we obtain

E gm,b(xk) 2 C

b 2 L f(xk) f(x )

b n m m(n 1)2Lmax R + b 1

b 2 LR . (21)

Substituting (21) into (16), we have

E xk+1 x 2 (1 µγ) xk x 2 2γ f(xk) f(x ) + γ2E gm,b(xk) 2

(1 µγ) xk x 2 2 C

b 2 L γ γ f(xk) f(x )

b n m m(n 1)2Lmax R + b 1

(1 µγ) xk x 2 + γ2 1

b n m m(n 1)2Lmax R + b 1

The last inequality holds because stepsize γ 2b/ C + 2 L(b 1) . Taking the total expectation and unrolling this recursion across T iterations, we can obtain

E x T x 2 (1 µγ)T E x0 x 2

b 2Lmax R(n m)

m(n 1) + b 1

b 2 LR T 1 X

j=0 (1 µγ)j

(1 µγ)T E x0 x 2

b 2Lmax R(n m)

m(n 1) + b 1

j=0 (1 µγ)j

= (1 µγ)T E x0 x 2 + γ

b 2Lmax R(n m)

m(n 1) + b 1

A.2 Proof of equations (4), (5) and (6)

Proof We first derive the upper bounds for per step noise variance of different methods. Equation (15) is repeatedly used in the following proof. Denote a uniformly sampled index

GGD: Grafting Gradient Descent

from [n] in k-th step by ik. The upper bound for per step noise variance of SGD given xk is

E h fik(xk) 2i = 1

j=1 fj(xk) 2 1

j=1 2Lj fj(xk) fj,min

fj(xk) fj,min

= 2Lmax f(xk) f(x ) + 2Lmax R. (22)

We have derived the upper bound for per step noise variance of mini-batch SGD in (19). As for SGD with importance sampling, following the analysis provided by Zhao and Zhang (2015), we know that the upper bound for this noise variance is equivalent to

E 1 (n Pik)2 fik(xk) 2 = 1

1 Pj fj(xk) 2

1 Pj 2Lj fj(xk) fj,min

fj(xk) fj,min

= 2 L f(xk) f(x ) + 2 LR. (23)

Equations (19), (22) and (23) share the common property since they can all be written in form of A f(xk) f(x ) + BR, (24)

with different values of A and B. Replacing the last term in (16) by (24), we can obtain

E xk+1 x 2 (1 µγ) xk x 2 (2γ Aγ2) f(xk) f(x ) + γ2BR. (25)

With the proper choice of stepsize γ, the second term in (25) can be absorbed since it is non-positive. Thus we can obtain

E xk+1 x 2 (1 µγ) xk x 2 + γ2BR.

Unrolling these recursions, we can obtain equations (4), (5) and (6).

A.3 Proof of Corollary 5

Proof As long as (1 µγ)T E x0 x 2 ϵ/2 and 2γR/bµD ϵ/2, the expected optimality gap satisfies E x T x 2 ϵ. From the definition of stepsize γ, we know that 2γR/bµD ϵ/2. Thus we only need to ensure that (1 µγ)T E x0 x 2 ϵ/2, which means

T ln(1 µγ) + ln(2E x0 x 2) ln ϵ.

Feng and Zhou

Rearranging the terms and noticing that ln(1 x) x holds for x [0, 1), as long as we keep

Tγµ ln 2E x0 x 2

the expected optimality gap is less than ϵ, which means that T satisfies

2, 2b µ(C + 2 L(b 1)), 4R ϵµ2b D

ln 2E x0 x 2

A.4 Proof of Theorem 6

Proof Similar to the proof of Theorem 4, we can write a decomposition for xk+1 x with a diminishing stepsize sequence,

E xk+1 x 2 (1 µγk) xk x 2

f(xk) f(x ) + γ2 k 2R b D

(1 µγk) xk x 2 + γ2 k 2R b D.

The last inequality holds because {γk} is decreasing and γ0 < 2b/ C + 2(b 1) L . Taking the total expectation, the rest part will be proven by induction. First, the definition of v ensures that it holds for k = 0. Assuming E xk x 2 v/(q + k) holds for some k 0, it follows that

E xk+1 x 2 (1 µγk)E xk x 2 + 2γ2 k R b D

(1 pµ q + k) v q + k + p2

(q + k)2 2R b D

(q + k)2 v pµ 1

(q + k)2 v + 2p2R (q + k)2b D

(q + k)2 v v q + k + 1,

where the third inequality follows due to the definition of v, and the last inequality follows because (q + k 1)(q + k + 1) (q + k)2.

A.5 Proof of Theorem 8

Proof Using the proof of Theorem 4, we know that

E xk+1 x 2 = E xk x 2 2γ xk x , f(xk) + γ2E gm,b(xk) 2.

GGD: Grafting Gradient Descent

Since the objective function f is convex, we have

E xk+1 x xk x 2 2γ f(xk) f(x ) + γ2E gm,b(xk) 2. (26)

Again using the proof of Theorem 4, the last term of (26) can be bound by

E gm,b(xk) 2 C

b 2 L f(xk) f(x ) + 2R

Substituting (27) into (26), we have

E xk+1 x xk x 2 γ

C + (b 1)2 L γ b

! f(xk) f(x ) + γ2 2R

Rearranging the terms, summing over T iterations and taking the total expectation, we obtain

C + (b 1)2 L γ b

k=0 E h f(xk) f(x ) i E x0 x 2 E x T x 2 + 2γ2TR

E x0 x 2 + 2γ2TR

From the definition of stepsize γ, we know that

k=0 E h f(xk) f(x ) i γ

C + (b 1)2 L γ b

k=0 E h f(xk) f(x ) i ,

thus we have γ 2

k=0 E h f(xk) f(x ) i E x0 x 2 + 2γ2TR

Dividing γT/2 on the both side, due to the convexity of the objective function f, we have

E [f(ˆx) f(x )] 2E x0 x 2

A.6 Proof of Corollary 9

Proof From the definition of the stepsize γ, we can verify that 4γR/b D ϵ/2. To achieve ϵ-optimality, we need to make sure that the first term in (7) is not greater than ϵ/2. That is 2E x0 x 2

Rearranging the terms, we know that T should satisfy

T 4E x0 x 2

ϵ min{3b/2(C + (b 1)2 L), ϵb D/8R}.

Feng and Zhou

A.7 Proof of Theorem 10

Proof From the L-smoothness of the objective function f, we have

f(xk+1) f(xk) + f(xk), xk+1 xk + L

2 xk+1 xk 2.

Take the conditional expectation on xk,

E[f(xk+1)] f(xk) γ f(xk) 2 + 1

2γ2LE gm,b(xk) 2. (28)

From (17), we can bound E gm,b(xk) 2 by,

E gm,b(xk) 2 1

b E f Sm1(xk) 2 + b 1

b 2 L f(xk) fmin . (29)

Substituting (29) into (28), we obtain

E[f(xk+1)] f(xk) γ f(xk) 2 + γ2L

b E f Sm1(xk) 2 + b 1

b 2 L f(xk) fmin

m(n 1) 1 f(xk) 2

n m m(n 1) 1 n

i=1 fi(xk) 2 + (b 1)2 L f(xk) fmin !

m(n 1) 1 f(xk) 2 + γ2L

f(xk) fmin .

The last inequality holds due to (15). Substracting fmin on the both side, we obtain

E[f(xk+1)] fmin γ γL

m(n 1) 1 f(xk) 2 + 1 + γ2L

f(xk) fmin .

Since γ (2b 1)/L and m(n 1)/n(m 1) 1, we have

E[f(xk+1)] fmin γ

2b f(xk) 2 + 1 + γ2L

f(xk) fmin .

Let δk = E f(xk) fmin. Taking total expectation and rearranging the terms, we have

1 E f(xk) 2 δk 1 + γ2L

1 δk+1. (30)

Consider the sequence {αk}k=0, where αk = (1 + γ2L/b D) kα0 and α0 > 0 is a constant. Multiplying αk on the both side of (30), we obtain

1 αk E f(xk) 2 αkδk αk+1δk+1.

GGD: Grafting Gradient Descent

Summing over T iterations and rearranging the terms, we obtain

k=0 αk E f(xk) 2 α0δ0 αT δT α0δ0.

since the series PT 1 k=0 αk is finite, we have

1 min k=0,...,T 1 E f(xk) 2 α0δ0 PT 1 k=0 αk .

Rearranging the terms, we can conclude

min k=0,...,T 1 E f(xk) 2 2b

1 + b D γ2LT

The last inequality holds because (1 + β)n 1 + nβ for n N, n 1 and β > 1.

A.8 Proof of Corollary 11

Proof To achieve ϵ-optimality, T should satisfy

1 + b D γ2LT

which is equivalent to b D γLT + γ ϵD

Since γ ϵD/4Lδ0. As long as inequality

b D γLT + γ b D

Feng and Zhou

holds, we can achieve ϵ-optimality. Rearranging the terms, we can conclude

ϵ max L 2b 1, 4Lδ0

A.9 Proof of Theorem 12

Proof Let Di m = {S | S Dm, i S}. Following the proof of Theorem 4, we first prove that the grafting gradient using sampling without replacement is an unbiased estimator with respect to the full gradient.

E 1 b PSmr1

" 1 b PSmr1

S Dim I{S Sbm}

xk Cm 1 n 1 b Cm n = 1

fi x1 (xk),

which proves that the grafting gradient using sampling without replacement is an unbiased estimator with respect to the full gradient. Recall the recursion we wrote the proof of Theorem 4.

E xk+1 x 2 (1 µγ) xk x 2 2γ f(xk) f(x ) + γ2E gm,b(xk) 2. (31)

Since fi is Li-smooth, the theoretical bound for E gm,k(xk) 2 can be derived as follows.

E gm,b(xk) 2 = E

1 PSmi f Smi(xk) 2 #

1 PSmi 2LSmi

f Smi(xk) f Smi,min #

i=1 f Smi f Smi,min

GGD: Grafting Gradient Descent

S Dim I{S Sbm}

fi(xk) fi,min

S Dim I{S Sbm}

i=1 2Li fi(xk) fi,min

S Dim I{S Sbm}

p =q 2Lp fq(xk) fq,min

U Dp m I{U Sbm}

V Dq m I{V Sbm}

Now we first deal with the term E h P S Dim I{S Sbm} i2 .

S Dim I{S Sbm}

S Dim I{S Sbm} + X

U,V Di m U =V

I{U Sbm} I{V Sbm}

= Cm 1 n 1 b Cm n + Cm 1 n 1 (Cm 1 n 1 1) b(b 1) Cm n (Cm n 1)

1 + (Cm 1 n 1 1)(b 1)

Denote Dp,q m = {S | S Dm and p, q S}. Likewise, the term E P U Dp m I{U Sbm} P V Dq m I{V Sbm} is equivalent to

U,V Dp,q m I{U Sbm}

U Dp m,V Dq m V,U / Dp,q m

I{U Sbm} I{V Sbm}

= Cm 2 n 2 b Cm n + Cm 1 n 1 2 Cm 2 n 2 b(b 1) Cm n (Cm n 1)

n(n 1) + mb(b 1)Cm 1 n 1 n(Cm n 1) m(m 1)b(b 1)

n(n 1)(Cm n 1)

= mb(b 1)Cm 1 n 1 (Cm n 1) + m(m 1)b

1 b 1 Cm n 1

Feng and Zhou

Combining (32), (33) and (34), we have

E gm,b(xk) 2 2 b2m2

i=1 Li fi(xk) fi,min + M2

p =q Lp fq(xk) fq,min

i=1 fi(xk) fi,min

i=1 Li fi(xk) fi,min #

b2m2 M2 2 L f(xk) fmin

+ 2n b2m2 (M1 M2) 1

i=1 Li fi(xk) fi,min

b2m2 M2 2 L f(xk) fmin

+ n b2m2 (M1 M2) 2 L f(xk) fmin . (35)

Substituting (35) into (31), we can obtain

E xk+1 x 2 (1 µγ) xk x 2 2γ f(xk) f(x )

b2m2 M2 2 L f(xk) fmin

+ γ2 n b2m2 (M1 M2) 2 L f(xk) fmin

(1 µγ) xk x 2

n2 L b2m2 M2 + n L

b2m2 (M1 M2)

! f(xk) f(x )

n2 M2 L + n(M1 M2) L . (36)

From the definition of stepsize γ, we know that the second term in the last inequality of (36) can be absorbed. Thus we can derive

E xk+1 x 2 (1 µγ) xk x 2 + 2γ2R

n2 M2 L + n(M1 M2) L .

Taking total expectation and unrolling this recursion across T iterations, we can obtain

E x T x 2 (1 µγ)T E x0 x 2 + 2γRM

GGD: Grafting Gradient Descent

A.10 Proof of Corollary 16

Proof Following the proof of Corollary 5, we know that as long as (1 µγ)T E x0 x 2 ϵ/2. GGD-Wo R can achieve ϵ-optimality under strongly-convex assumption, which means

T ln(1 µγ) + ln 2E x0 x 2 ln ϵ.

Rearranging the terms and noticing that ln(1 x) x holds for x [0, 1), as long as we keep

Tγµ ln 2E x0 x 2

the expected optimality gap is less than ϵ, which means that T should satisfy

2, M b2m2µ, 4RM ϵµ2b2m2

ln 2E x0 x 2

A.11 Proof of Theorem 17

Proof Using the proof of Theorem 8, we have

E xk+1 x 2 xk x 2 2γ f(xk) f(x ) + γ2E gm,b(xk) 2. (37)

From (35), we know that E gm,b(xk) 2 can be bounded by

E gm,b(xk) 2 2M

f(xk) f(x ) + 2RM

b2m2 . (38)

Combining (37) and (38), we can derive

E xk+1 x 2 xk x 2 2γ 1 γM

f(xk) f(x ) + 2γ2RM

Rearranging the terms and noting γ b2m2/2M, we can obtain

γ f(xk) f(x ) xk x 2 E xk+1 x 2 + 2γ2RM

Taking total expectation and summing over T iterations, we have

k=0 E h f(xk) f(x ) i E x0 x 2 + 2γ2RMT

Dividing γT on both side, due to the convexity of the objective function f, we have

E [f(ˆx) f(x )] E x0 x 2

Feng and Zhou

A.12 Proof of Corollary 18

Proof From the definition of stepsize γ, we know that as long as E x0 x 2/Tγ ϵ/2, that is

T 2E x0 x 2

holds, GGD-Wo R can achieve ϵ-optimality. since γ = b2m2/2M min{1, ϵ/2R}, we can obtain

T 4ME x0 x 2

ϵb2m2 min{1, ϵ/2R}.

A.13 Proof of Theorem 19

Proof Using the proof of Theorem 10, we can obtain

E h f(xk+1) i f(xk) γ f(xk) 2 + γ2L

2 E gm,b(xk) 2. (39)

Substituting (38) into (39), we have

E h f(xk+1) i f(xk) γ f(xk) 2 + γ2LM

f(xk) fmin .

Consider the sequence {αk}, where αk = (1 + γ2LM/b2m2)α0 and α0 > 0 is a constant. Recall the technical tricks used in proof of Theorem 10, we can proceed the proof with

1 αk E f(xk) 2 αkδk αk+1δk+1.

Summing over T iterations and rearranging the terms, we have

1 min k=0,...,T 1 E f(xk) 2 α0δ0 PT 1 k=0 αk .

Rearranging the terms, we can conclude

min k=0,...,T 1 E f(xk) 2 1

α0δ0 PT 1 k=0 αk

GGD: Grafting Gradient Descent

Appendix B. Proofs of Theorems and Corollaries in Section 5

In this section, we prove the theorems and corollaries in Section 5.

B.1 Proof of Theorem 21

Proof Recall that the decomposition used in the proof of Theorem 4,

xk+1 s x 2 = xk s γ gk m,b x 2 = xk s x 2 2γ xk s x , gk m,b + γ2 gk m,b 2.

Taking the conditional expectation on xk s and all past, we can obtain

E xk+1 s x 2 = xk s x 2 2γ xk s x , f(xk s) + γ2E gk m,b 2. (40)

Now we deal with the third term in (40),

E gk m,b 2 = E gm,b(xk s) gm,b( x) + f( x) 2 2E gm,b(xk s) gm,b( x) 2 + 2 f( x) 2. (41)

We first deal with E gm,b(xk s) gm,b( x) 2, similar to the proof of Theorem 4,

E gm,b(xk s) gm,b ( x) 2 = E

1 b2P 2 Smri

xi (xk s) f Smri

1 b2P 2 Smri

xi (xk s) f Smri

xi (xk s) f Smj

1 PSmj f Smj (xk s) f Smj ( x) 2

Recalling the resampling probability defined in Algorithm 2, we have

E gm,b(xk) gm,b( x) 2 = E

j=1 f Smj (xk s) f Smj ( x)

j=1 f Smj (xk s) f Smj ( x) 2

u =t f Smu(xk s) f Smu( x) f Smt(xk s) f Smt( x)

b E f Sm1(xk s) f Sm1( x) 2

E f Sm1(xk s) f Sm1( x) 2 . (42)

Feng and Zhou

We first deal with the second term in the last equality of (42).

E f Sm1(xk s) f Sm1( x) 2 E

i Sm1 fi(xk s) fi( x)

i=1 fi(xk s) fi( x)

i=1 fi(xk s) fi( x) 2, (43)

where the last inequality holds due to H older inequality. The last inequality can be further bounded by

i=1 fi(xk s) fi( x) 2 = 1

i=1 fi(xk s) fi(x ) + fi(x ) fi( x) 2

i=1 fi(xk s) fi(x ) 2 + 2

i=1 fi(x ) fi( x) 2

fi(xk s) fi(x ) fi(x ), xk s x

i=1 (fi( x) fi(x ) fi(x ), x x )

= 4L f(xk s) f(x ) + 4L (f( x) f(x )) . (44)

The last inequality holds because of (14). Now we back to the first term in the last equality of (42). From (19), if we replace fi(xk) with fi(xk s) fi( x), then we have

E f Sm1(xk s) f Sm1( x) 2 = E

fi(xk s) fi( x) 2

m(n 1) f(xk s) f( x) 2

+ n m m(n 1) 1 n

i=1 fi(xk s) fi( x) 2. (45)

Substituting (43) and (45) into (42), we have

E gm,b(xk s) gm,b( x) 2 n(m 1)

bm(n 1) f(xk s) f( x) 2

+ n m bm(n 1) + b 1

i=1 fi(xk s) fi( x) 2.

GGD: Grafting Gradient Descent

From L-smoothness and (44), we can obtain

E gm,b(xk s) gm,b( x) 2 4L f(xk s) f(x ) + 4L (f( x) f(x )) . (46)

Substituting (46) into (41), we have

E gk m,b 2 2E gm,b(xk s) gm,b( x) 2 + 2 f( x) 2

8L f(xk s) f(x ) + 12L (f( x) f(x )) . (47)

Substituting (47) into (40), we can obtain

E xk+1 s x 2 xk s x 2 2γ xk s x , f(xk s) + 8Lγ2 f(xk s) f(x )

+ 12Lγ2 (f( x) f(x ))

(1 γµ) xk s x 2 2γ(1 4γL) f(xk s) f(x )

+ 12Lγ2 (f( x) f(x )) . (48)

The second inequality holds due to µ-strong convexity. Using Lemma 31, we know that (1 γµ) > 0 since γ 1/16L and L µ. Taking the total expectation and iterating over k = 0, 1, ..., q 1, we have

E xq s x 2 (1 γµ)q E x0 s x 2

k=0 (1 γµ)q 1 k E h f(xk s) f(x ) i

+ 12Lγ2E [f( x) f(x )]

k=0 (1 γµ)q 1 k.

Considering the option (a) in the outer loop and the definitions of Vq and pk in (10), we have

E xq s x 2 (1 γµ)q E xq s 1 x 2 2γ(1 4Lγ)Vq

k=0 pk E h f(xk s) f(x ) i

+ 12Lγ2Vq E [f( x) f(x )] . (49)

Define Lyapunov function Φs as follows.

Φs = xq s x 2 + Ψs, where Ψs = 24Lγ2Vq E [f( xs) f(x )] .

Noticing that 1 γµ > 0 implies that pk > 0 for k = 0, ..., q 1 and Pq 1 k=0 pk = 1, we have

f( xs) f(x ) = f

k=0 pk f(xk s) f(x ) .

Feng and Zhou

The last inequality holds using Jensen s inequality and the fact that f is convex. Hence, the expectation of Ψs can be bounded by

E[Ψs] 24Lγ2Vq

k=0 pk E h f(xk s) f(x ) i . (50)

Taking the total expectation of Lyapunov function, we have

E[Φs] = E xk s x 2 + E[Ψs]. (51)

Substituting (49) into (51), we have

E[Φs] (1 γµ)q E xq s 1 x 2 2γ(1 4Lγ)Vq

k=0 pk E h f(xk s) f(x ) i

+ 12Lγ2Vq E [f( x) f(x )] + E[Ψs].

Noticing that x = xs 1 and combining (50), we have

E[Φs] (1 γµ)q E xq s 1 x 2 2γ(1 4Lγ)Vq

k=0 pk E h f(xk s) f(x ) i

+ 12Lγ2Vq E [f( xs 1) f(x )] + 24Lγ2Vq

k=0 pk E h f(xk s) f(x ) i

= (1 γµ)q E xq s 1 x 2 + 1

2γ(1 16Lγ)Vq

k=0 pk E h f(xk s) f(x ) i

(1 γµ)q E xq s 1 x 2 + 1

2E[Ψs 1] ρE[Φs 1]

where the second inequality holds due to γ 1/16L and ρ = max{(1 γµ)q, 1/2}. Recursively applying this inequality for s times outer loops, we have

E[Φs] ρs E[Φ0].

Since Ψs 0, we can obtain E xq s x 2 ρs E[Φ0].

Due to the L-smoothness of f, we have

E xq s x 2 ρs(1 + 12L2γ2Vq)E xq 0 x 2.

GGD: Grafting Gradient Descent

B.2 Proof of Corollary 22

Proof To obtain an ϵ-optimal solution, we should ensure

ρT (1 + 12L2γ2Vq)E xq 0 x 2 ϵ.

Since q = n and γ = 1/16L, we can obtain

ρT 64 + 3Vq

E xq 0 x 2 ϵ.

Rearranging the terms and taking logarithm on the both side, we have

T ln ρ ln (64 + 3Vq)E xq 0 x 2

which is equivalent to

T 1 ln(1/ρ) ln (64 + 3Vq)E xq 0 x 2

Recalling that ρ = max{(1 γµ)q, 1/2}, then we have

1 ln(1/ρ) = max

n 1 ln 1 1 16κ , 1

Since ln x x 1 for all x > 0, (52) can be upper bounded by

n 1 ln 1 1 16κ , 1

= 1 ln(1/ρ).

Then as long as the iteration number for the outer loop T satisfies

n , 2 ln (64 + 3Vq)E xq 0 x 2

GGD-WR-SVRG can achieve the ϵ-optimality. Noticing that within one outer loop, the number of partial derivative evaluations is n(1 + mb)d. Then the total complexity is

2(1 + bm)d max {8κ, n} ln (64 + 3Vq)E xq 0 x 2

Feng and Zhou

B.3 Proof of Theorem 23

Proof From the first inequality of (48) and the convexity of f, we know that

E xk+1 s x 2 xk s x 2 2γ (1 4Lγ) f(xk s) f(x ) + 12Lγ2 (f( x) f(x ))

xk s x 2 2γ(1 10Lγ) f(xk s) f(x ) + 12Lγ2 (f( x) f(x ))

12Lγ2 f(xk s) f(x ) .

Rearranging the terms, taking the total expectation, we obtain

2γ(1 10Lγ)E h f(xk s) f(x ) i E xk s x 2 + 12Lγ2E [f( x) f(x )]

E xk+1 s x 2 12Lγ2E h f(xk s) f(x ) i . (53)

Consider the option (b) in the outer loop and define Lyapunov function P s as follows,

P s E x0 s+1 x 2 + 12Lγ2q E [f( xs) f(x )] 0.

Sum (53) recursively over k = 0, 1, ...q 1. We obtain

k=0 E h f(xk s) f(x ) i P s 1 P s.

Summing over T outer loop, we have

k=0 E h f(xk s) f(x ) i P 0 P T P 0.

Dividing q T on the both side, we obtain

f(xk s) f(x ) #

2q Tγ(1 10Lγ).

Denoting ˆx = 1 q T PT s=1 Pq 1 k=0 xk s, due to the convexity of the function f, we can conclude

E [f(ˆx) f(x )] P 0

2q Tγ(1 10Lγ) = E x0 x 2 + 12Lγ2q E [f( x0) f(x )]

2q Tγ(1 10Lγ) . (54)

B.4 Proof of Corollary 24

Proof Substitute γ = 0.05/L into (54), we obtain

E [f(ˆx) f(x )] 20LE x0 x 2 + 0.6q E [f(x0) f(x )]

GGD: Grafting Gradient Descent

If 20LE x0 x 2/q T ϵ/2 and 0.6 [f(x0) f(x )] /T ϵ/2, we can ensure that the expected optimality gap E [f(ˆx) f(x )] is not greater than any given positive real number. To achieve ϵ-optimality, iteration number of outer loop T should satisfy

T max 40LE x0 x 2

nϵ , 1.2E [f( x0) f(x )]

Since q = n, the total complexity is

nd(1 + bm) max 40LE x0 x 2

nϵ , 1.2E [f( x0) f(x )]

B.5 Proof of Theorem 25

Proof Define Lyapunov function

Rk s = E h f(xk s) + ηk xk s x 2i ,

where ηk is defined in (11). From L-smoothness, we can obtain

f(xk+1 s ) f(xk s) + γ f(xk s), xk+1 s xk s + L

2 xk+1 s xk s 2.

Taking expectation condition on xk s and all past, we have

E h f(xk+1 s ) i h f(xk s) i γ f(xk s) 2 + γ2L

2 E gk m,b 2. (55)

Recalling the proof of Theorem 21, we can bound E gk m,b 2 using (40) where the term E gm,b(xk s) gm,b( x) 2 can be bounded by

E gm,b(xk s) gm,b( x) 2 n(m 1)

bm(n 1) f(xk s) f( x) 2

+ n m bm(n 1) + b 1

i=1 fi(xk s) fi( x) 2

bm(n 1)L2 xk s x 2

+ n m bm(n 1) + b 1

i=1 L2 xk s x 2

= L2 xk s x 2. (56)

The term f( x) 2 can be bounded by

f( x) 2 2 f(xk s) f( x) 2 + 2 f(xk s) 2 2L2 xk s x 2 + 2 f(xk s) 2. (57)

Feng and Zhou

Combining (56) and (57), we can bound E gk m,b 2 by

E gk m,b 2 6L2 xk s x 2 + 4 f(xk s) 2. (58)

Now let us set this result aside, E xk+1 s x 2 can be bounded by

E xk+1 s x 2 = E xk+1 s xk s + xk s x 2

= γ2E gk m,b 2 + xk s x 2 + 2γ f(xk s), xk s x

γ2E gk m,b 2 + xk s x 2 + γ

τ f(xk s) + τγ xk s x 2, (59)

where the last inequality holds because

τ > 0, x, y R, xy 1

From (55) and (59), taking total expectation, we can bound Rk+1 s by

Rk+1 s = E h f(xk+1 s ) i + ηk+1E xk+1 s x 2

E h f(xk s) i γE f(xk s) 2 + γ2L

2 E gk m,b 2

+ ηk+1 γ2E gk m,b 2 + E xk s x 2 + γ

τ E f(xk s) 2 + τγE xk s x 2 .

Combining (58), we can obtain

Rk+1 s E h f(xk s) i γE f(xk s) 2 + 3γ2L3E xk s x 2 + 2γ2LE f(xk s) 2

+ ηk+1γ2 6L2E xk s x 2 + 4E f(xk s) 2

+ (1 + τγ)ηk+1E xk s x 2 + ηk+1γ

τ E f(xk s) 2

= E h f(xk s) i γ 2γ2L 4γ2ηk+1 ηk+1γ

E f(xk s) 2

+ 3γ2L3 + 6γ2L2ηk+1 + (1 + τγ)ηk+1 E xk s x 2

= Rk s γ 2γ2L 4γ2ηk+1 ηk+1γ

E f(xk s) 2.

Rearranging the terms, we have γ 2γ2L 4γ2ηk+1 ηk+1γ

E f(xk s) 2 Rk s Rk+1 s .

Since {ηk} is decreasing, we can obtain γ 2γ2L 4γ2η0 η0γ

E f(xk s) 2 Rk s Rk+1 s .

Summing this recursion over q inner loops, noting that xs 1 = xq s 1 = x0 s and ηq = 0, we have

γ 2γ2L 4γ2η0 η0γ

k=0 E f(xk s) 2 R0 s Rk+1 s = E [f( xs 1) f( xs)] .

GGD: Grafting Gradient Descent

Summing this recursion over T outer loops, we have

γ 2γ2L 4γ2η0 η0γ

k=0 E f(xk s) 2 E f( x0) f(xq T ) E [f( x0) fmin] .

Dividing q T on the both side, rearranging the terms, we can obtain

min k=0,1,...,q 1 s=1,2,...,T E f(xk s) 2 1

k=0 E f(xk s) 2 E f(x0 1) fmin

q Tγ(1 2γL 4γη0 η0/τ).

B.6 Proof of Corollary 26

Proof To achieve the ϵ-optimality, we need to ensure

E f(x0 1) fmin

q Tγ(1 2γL 4γη0 η0/τ) ϵ.

Rearranging the terms, we have

T E f(x0 1) fmin

qγ 1 1 2γL 4γη0 η0/τ . (61)

The second term on the right side of (61) can be upper bounded by

n1/3 . (62)

η0 can be upper bounded by

η0 = 3γ2L3 (1 + τγ + 6γ2L2)q 1

τγ + 6γ2L2 = 3ψ2L

1 + ψ/n + 6ψ2/n4/3 n/7ψ 1

ψ/n + 6ψ2/n4/3

1 + ψ/n + 6ψ2/n4/3 n/7ψ 1

ψn1/3 + 6ψ2

3ψ2L(1 + 7ψ/n) n/7ψ 1

where the first inequality holds because ψ/n+6ψ2/n4/3 7ψ/n for ψ < 1 and ψn1/3+6ψ2 > ψn1/3, the last inequality holds because (1 + 1/x)x e. Combining the definition of ψ, we know that

1 2γL 4γη0 η0/τ 1 2ψ

n2/3 12ψ(e 1)

4 3ψ(e 1) 1

Feng and Zhou

Combining (62) and (63), we can upper bound the right side of (61) by

E f(x0 1) fmin

n1/3 E f(x0 1) fmin

qγ 1 1 2γL 4γη0 η0/τ .

Thus as long as we keep the iteration number of outer loop

T E f(x0 1) fmin

non-convex GGD-WR-SVRG can achieve ϵ-optimality, which indicates that the total complexity is

(n + n/7ψ mb)d 14LE f(x0 1) fmin

B.7 Proof of Theorem 28

Proof Most of this proof follows the analysis provided by D efossez et al. (2020) except that we derive the theoretical bound of GGD-WR-Adam using the original form of stepsize given by Kingma and Ba (2014) instead of a simplified stepsize given by D efossez et al. (2020). Denote

Gn = f(xn 1) and ζn,(i) = hn,(i) pσ + vn,(i) and ξn,(i) = gn,(i) pσ + vn,(i) ,

and define vn,k Rd as

vn,k,(i) = βk 2vn k,(i) + En k 1

j=n k+1 βn j 2 g2 j,(i)

where En k 1[ ] represents the expectation condition on all information before n k-th iteration. Before deriving the theoretical bound for GGD-WR-Adam, we first prove some useful lemmas.

Lemma 33 Suppose that Assumption 27 holds, the objective function f and the individual loss function fi are L-smooth and 0 < β1 < β2 < 1, then for all iterations n N + generated by Algorithm 3, it satisfies

i=1 Gn,(i) hn,(i) pσ + vn,(i)

" G2 n k,(i) pσ + vn,k+1,(i)

γ2 max L2 (1 β1)1/2

k=0 βk 1 k k + 1

l=1 E ζn l 2

3R (1 β)1/2

k (k + 1)E ξn k 2 !

GGD: Grafting Gradient Descent

Proof For some n N +, we have

i=1 Gn,(i) hn,(i) pσ + vn,(i) =

k=0 βk 1Gn k,(i) gn k,(i) pσ + vn,(i) | {z } A

k=0 βk 1 Gn,(i) Gn k,(i) gn k,(i) pσ + vn,(i) | {z } B

τ = (1 β1)1/2

2R(k + 1) , x = |gn k,(i)| pσ + vn,(i) and y = |Gn,(i) Gn k,(i)|.

Using (60), we have

4R(k + 1) Gn,(i) Gn k,(i) 2 + R(k + 1)

(1 β1)1/2 g2 n k,(i) σ + vn,(i)

Since σ + vn,(i) σ + βk 2vn k,(i) βk 2(σ + vn k,(i), we can obtain

4R(k + 1) Gn,(i) Gn k,(i) 2 + R(k + 1)

(1 β1)1/2 ξ2 n k,(i)

From the L-smoothness of objective function f, we have

Gn Gn k 2 L2 xn 1 xn k 1 2 = L2

l=1 γn lζn l

l=1 ζn l 2, (66)

where γmax = max{γk}. Substituting (66) into (65), we have

|B| γ2 max L2 n 1 X

k=0 βk 1 (1 β1)1/2k

+ R (1 β1)1/2

k (k + 1) ξn k 2. (67)

Now we turn to deal with the term A. For simplicity, we drop the indices for now and denote G = Gn k,(i), g = gn k,(i), v = vn,k+1,(i), v = vn,(i),

j=n k βn j 2 g2 j,(i) and r2 = En k 1θ2.

Feng and Zhou

Since v v = r2 θ2, taking total expectation, we can rewrite the inside terms of term A as

E G g σ + v

= E G g σ + v + Gg 1 σ + v 1 σ + v

+ Gg θ2 δ2 σ + v σ + v( σ + v + σ + v)

= E G2 σ + v

+ E[Gg θ2 δ2 σ + v σ + v( σ + v + σ + v) | {z } C

Now we take a look at C,

|C| |Gg| r2

(σ + v)1/2(σ + v) | {z } D

(σ + v)(σ + v)1/2 | {z } E

due to the fact that (σ + v)1/2 + (σ + v)1/2 max{(σ + v), (σ + v)} and |r2 θ2| r2 + θ2. Now using (60) again, if we let

τ = 1 β1 σ + v 2 , x = |g|r2 σ + v σ + v and y = |G| σ + v,

we can obtain

4 σ + v + 1 1 β1

g2r4 σ + v σ + v.

Given that σ + v r2, taking conditional expectation, we have

En k 1[D] G2

4 σ + v + 1 1 β1

r2 σ + v En k 1

Term E can be bounded in the similar way. If we let

τ = 1 β1 σ + v 2r2 , x = |θg|

σ + v and y = |Gθ| σ + v,

we can obtain

r2 + 1 1 β1

r2 σ + v g2θ2

(σ + v)2 . (70)

Considering that σ + v θ2 and E[θ2/r2] = 1, taking conditional expectation, we have

En k 1[E] G2

4 σ + v + 1 1 β1

r2 σ + v En k 1

Noticing that in (70), we possibly divide by zero. It is suffice to notice that if r2 = 0, then θ2 = 0 a.s. so that E = 0 and (71) still holds. Summing (69) and (71), we have

En k 1 [|C|] G2

2 σ + v + 2 1 β1

σ + v En k 1

GGD: Grafting Gradient Descent

Since r σ + v, and from the definition of r, we know that r (k + 1)R. Reintroducing the indices and noticing that σ + vn,(i) σ + βk 2vn k,(i) βk 2(σ + vn k,(i)), we can obtain after taking total expectation,

G2 n k,(i) pσ + vn,k+1,(i)

+ 2R(k + 1) (1 β1)1/2βk 2 E

" g2 n k,(i) σ + vn k,(i)

Substituting (72) into (68), we have

" G2 n k,iβk 1 pσ + vn,k+1,i

" G2 n k,iβk 1 pσ + vn,k,i

+ 2R(k + 1) 1 β1βk 2 E

" g2 n k,iβk 1 σ + vn k,i

" G2 n k,i pσ + vn,k+1,i

k (k + 1)E ξn k 2 !

Combining (67) and (73), we prove Lemma 33.

Lemma 34 Suppose that 0 < β1 < β2 1. For a sequence of real numbers {an}, denoting bn = Pn j=1 βn j 2 a2 j and cn = Pn j=1 βn j 1 aj, then we have

c2 j σ + bj 1 (1 β1)(1 β1/β2)

Proof For some j N +, j n, using Jensen inequality, we have

c2 j 1 1 β1

l=1 βj l 1 a2 l ,

so that c2 j σ + bj 1 1 β1

l=1 βj l 1 a2 l σ + bj .

From the definition of bl, for any l j, we know that σ + bj σ + βj l 2 bl βj l 2 (σ + bl). Then we can obtain c2 j σ + bj 1 1 β1

j l a2 l σ + bl .

Summing over all j [n], we have

c2 j σ + bj 1 1 β1

j l a2 l σ + bl

a2 l σ + bl

1 (1 β1) (1 β1/β2)

a2 l σ + bl . (74)

Feng and Zhou

The main term in summation (74) can be bounded by

a2 l σ + bl ln(σ + bl) ln(σ + bl a2 l )

= ln(σ + bl) ln(σ + β2bl 1)

= ln σ + bl

+ ln σ + bl 1

Then (74) can be further bounded by

c2 j σ + bj 1 (1 β1) (1 β1/β2)

a2 l σ + bl

1 (1 β1)(1 β1/β2)

which proves Lemma 34.

Lemma 35 Given 0 < x < 1, we have

n=0 xnn = x (1 x)2 .

Proof Since 0 < x < 1, we know that the infinite series P n=0 xnn is convergent. Denote S(x) = P n=0 xnn, then for any 0 < x < 1, dividing x on the both side, then we have

0 nxn 1dx =

n=0 sn = s 1 s.

Taking derivatives with respect to s, we have

s = 1 (1 s)2 ,

which proves Lemma 35.

Having finished the proof of three useful lemmas, we now formally prove Theorem 28. For some n N +, using L-smoothness of objective function f, we have

f(xn) f(xn 1) γn G n ζn + γ2 n L

GGD: Grafting Gradient Descent

Taking total expectation and using Lemma 33, we have

E [f (xn)] E f xn 1 γn

" G2 n k,(i) 2pσ + vn,k+1,(i)

2 E h ζn 2i

+ γ3 max L2 1 β1

l=1 E ζn l 2 n 1 X

k=l βk 1 k k + 1

+ 3γn R 1 β1

k (k + 1)E ξn k 2 !

Since for any k N and k < n, we have pσ + vn,k+1,(i) R q Pn 1 j=0 βj 2. Denoting Ωn = q Pn 1 j=0 βj 2, we have

E [f (xn)] E f xn 1 γn 2RΩn

k=0 βk 1E Gn k 2 !

2 E h ζn 2 2 i

+ γ3 max L2 1 β1

l=1 E ζn l 2 2

k=l βk 1 k k + 1

+ 3γn R 1 β1

k (k + 1)E ξn k 2 2

Summing over all T iterations, rearranging the terms, noticing that the objective function is bounded below by fmin, we have

k=0 βk 1E Gn k 2

E f x0 fmin + γ2 max L

+ γ3 max L2(1 β1)1/2

l=1 E ζn l 2 n 1 X

k=l βk 1 k k + 1 | {z } H

+ 3γmax R (1 β1)1/2

k (k + 1)E ξn k 2

Now we proceed these terms sequentially. Note that γn/Ωn in term F can be lower bounded by

γn Ωn = γ 1 β1 (1 β2)1/2 (1 βn 2 )1/2

(1 βn 2 )1/2 = γ 1 β1

1 βn 1 γ(1 β1).

Feng and Zhou

Using the change of index j = n k, we can obtain

j=1 βn j 1 E Gj 2

j=1 E Gj 2 T X

1 βT j+1 1 E Gj 2

1 βT j+1 1 E f xj 1 2

1 βT j 1 E f xj 2 .

Considering the random index ω defined in Theorem 28 and noticing that

j=0 (1 βT j 1 ) = T β1 1 βT 1 1 β1 T β1 1 β1 = T,

2RE f(xω) 2. (76)

Then we take a look at term G. Using Lemma 34, we have

G γ2 max L 2(1 β1)(1 β1/β2)

ln 1 + v T,(i)

Noting that γmax is equivalent to

γ2 n = γ2 (1 β1)2

(1 β2) 1 βn 2 (1 βn 1 )2 γ2 (1 β1)2

(1 β2) 1 (1 βn 1 )2

γ2 1 (1 β2)(1 + β1)2 γ2

1 β2 = γ2 max,

where the last inequality holds because (1 βn 1 )2 = (1 β1)2(Pn 1 k=0 βk 1)2 (1 β1)2(1+β1)2, we can thus obtain

G γ2L 2(1 β1)(1 β2)(1 β1/β2)

ln 1 + v T,(i)

T ln(β2) . (77)

GGD: Grafting Gradient Descent

Next we proceed with term H. Replacing the index j with n l, we have

H = γ3 max L2(1 β1)1/2

j=1 E ζn l 2 n 1 X

k=n j βk 1 k k + 1

= γ3 max L2(1 β1)1/2

j=1 E ζj 2 T X

k=n j βk 1 k k + 1

= γ3 max L2(1 β1)1/2

j=1 E ζj 2 T 1 X

k=0 βk 1 k k + 1

= γ3 max L2(1 β1)1/2

j=1 E ζj 2 T 1 X

γ3 max L2(1 β1)1/2

4R β1 (1 β1)2

4R (1 β1)1/2

(1 β2)3/2 1 (1 β1)2

j=1 E ζj 2 = γ3L2

4R 1 (1 β2)3/2(1 β1)3/2

j=1 E ζj 2,

where the first inequality holds due to Lemma 35. Using Lemma 34 again, we can obtain

4R 1 (1 β2)3/2(1 β1)3/2

4R 1 (1 β2)3/2(1 β1)5/2(1 β1/β2)

ln 1 + v T,(i)

T ln(β2) . (78)

Finally, we can move on to term I. Replacing index j with n k, we can obtain

I 3γmax R (1 β1)1/2

k (k + 1)E ξn k 2

3γmax R (1 β1)1/2

j=1 E ξj 2 T X

n j (n j + 1)

3γR (1 β1)1/2(1 β2)1/2(1 β1/β2)2

j=1 E ξj 2.

Again using Lemma 34, we can obtain

I 3γR (1 β1)3/2(1 β2)1/2(1 β1/β2)3

ln 1 + v T,(i)

T ln(β2) (79)

Noticing that we have v T,(i) R2/1 β2 for any i [d], substituting (76), (77), (78), (79) into (75), we can derive the desired result

E f(xω) 2 2RE f(x0) fmin

Feng and Zhou

J = γd RL (1 β1)(1 β2)(1 β1/β2) + γ2d L2

2(1 β2)3/2(1 β1)5/2(1 β1/β2)

(1 β1)3/2(1 β2)1/2(1 β1/β2)3 .

B.8 Proof of Corollary 29

Proof From (12), we know that as long as

2RE f(x0) fmin

T ln 1 + TR2

we can ensure the ϵ-optimality. The first inequality in (80) indicates that

T 36R2 Ef(x0) fmin 2

γ2ϵ2 . (81)

The left side of second inequality in (80) can be upper bound by

T ln 1 + TR2

= K ln 1 + TR2

1/2 K 1 + R2

1/2 ln 1 + TR2

1/2 ln 1 + TR2

where the last inequality holds due to ln(x)/xϕ 1/(ϕe) for ϕ (0, 1/2). Thus we only need to ensure that 3K ϕeϵ

1/2 (1 + TR2

which is equivalent to

2 1 2ϕ 1 + R2

Since the second inequality can be bounded by ϵ/3, the third inequality can be bounded by ϵ/3 as well. Combining (81) and (82), we have

( 36R2 Ef(x0) fmin 2

2 1 2ϕ 1 + R2

GGD: Grafting Gradient Descent

Data set GGD MBSGD SGD-IS SVRG GSVRG Adam GAdam

covtype 1,288.7 1,077.8 4,512.5 2,249.6 3,042.1 1,002.6 1,388.5 ijcnn1 302.8 230.3 527.2 369.1 496.28 280.1 373.3 a9a 165.7 118.4 318.7 318.7 386.6 125.8 172.6 rcv1 649.7 388.3 762.9 1,320.3 2,280.6 516.5 796.8 MNIST 942.1 766.9 ** 988.2 1,835.4 721.2 982.2 CIFAR-10 9,849.7 7,718.2 ** 15,361.5 19,981.2 8,026.5 10,662.3

Table 4: Summary of the execution time (s)

Appendix C. Additional Experiment Results and Network Architectures

In Appendix C, we present some additional experiment results to complement our discussions and provide the convolution neural network architectures used in Section 6.

C.1 Execution Time and Time Plots

All the experiments are carried out on a personal computer (3.70 GHz 12th Gen Intel Core i5 with 16 GB RAM and NVIDIA RTX 3080). Table 4 presents the execution time of our programs. From these results, we can see that in general, the increasing in the size of data sets and the complexity of model, may lead to a longer time that CPU and GPU charge for execution. The execution time for training CNNs using SGD-IS is not recorded here as the importance sampling probability is not available.

To further explore the relation of execution time and the performance, we present the time plots which depict the trade offs between the execution time (x-axis) and the train loss, square norm of full gradient and test loss (y-axis) respectively in Figures 12 and 13 where we sketchily show the efficiency of different algorithms in practical use. From Figure 12 and Table 4, we can see that for GAdam and Adam, since grafting gradient based methods take more time to finish one epoch of training, GAdam is little worse than Adam at the beginning. However, after about 500 seconds of training, GAdam catches up with Adam and their performances become quite indistinguishable. The comparison between GSVRG and SVRG is slightly different in two counts. One is that GSVRG achieves a lower train loss after about 750 seconds of training and the other is that SVRG achieves a much lower test loss. When comparing GGD with MBSGD and SGD-IS, we find that although GGD decreases slower than those two methods, it achieves comparable performances after 700 seconds of training. For training cifar10-nv on CIFAR-10 data set, generally speaking, grafting gradient based methods are more efficient than stochastic gradient based methods. Especially in Figures 13(g) and 13(i), GGD-WR-SVRG shows significant improvement over SVRG. Combining these two figures and the results in Section 6, we can conclude that in practice, it is worth using grafting gradient to update model parameters as its improvements over stochastic gradient based methods can compensate the time cost to some extent.

Feng and Zhou

Figure 12: Comparisons of the train loss (left), square norm of full gradient (middle) and test loss (right) on rcv1.

C.2 Impact of Minibatching on GGD and SGD-IS

We also run the addtional experiments which are used to compare the impact of the subsampled set size m on GGD and SGD with importance sampling. For the latter one, we use the average of gradients which are sampled from population with probability Pri Lri to update the model parameters. These additional experiments are run to solve the same L2-regularized logistic regression problems on ijcnn and a9a data sets. For fair comparison, t-inverse learning schedule is used both for mini-batch SGD with importance sampling and GGD, the learning rates of two methods with same subsampled set size m are set to be the same. We present the empirical results in Figures 14 and 15.

GGD: Grafting Gradient Descent

Figure 13: Comparisons of the train loss (left), square norm of full gradient (middle) and test loss (right) on CIFAR-10.

The suffix 2 k for k {4, 5, 6, 7} represents the subsampled set size m (ranging from 16 to 128) used in algorithms. From these results, we can see that when the performance of MBSGD is quite close to the performance of SGD with importance sampling as shown in Figure 4(b), GGD does not benefit a lot from increasing mini-batch size as shown in Figure 15. Coincidentally from Figure 14 we can see that GGD with larger subsampled set size m can achieve a lower train loss and test loss when the performance of SGD-IS is better than the performance of mini-batch SGD. As for SGD with importance sampling, results in Figures 14 and 15 show that SGD with importance seems to be insusceptible to minibatching technique. Combining Figures 14, 15 and the corresponding results in Figure

Feng and Zhou

Figure 14: Comparisons of the train loss (left), square norm of full gradient (middle) and test loss (right) on ijcnn.

4, we can conclude that when mini-batch SGD performs quite similarily to SGD with importance sampling, that is, when using minibatching technique is not more profitable than using importance sampling technique solely, GGD does not benefit from minibatching technique more than SGD with importance sampling does. On the contrary, when SGD with importance sampling outperforms the mini-batch SGD, minibatching technique seems to be more impactful when applying to GGD as increasing subsampled set size m can results in the improvement of GGD methods. This is expected since increasing subsampled set size m can reduce the noise variance of mini-batch SGD, that is, improving the worse bound for the noise variance of grafting gradient so that the performance of GGD can be improved as well.

C.3 Additional Experiments on Ridge Regression

To further check the performances of grafting gradient based methods on hard problem, that is, the ill-conditioned problem with extremely large condition number κ. We additionally run experiments to solve ridge regression problem given by

fi(x) = (zi a i x)2 + λ

GGD: Grafting Gradient Descent

Figure 15: Comparisons of the train loss (left), square norm of full gradient (middle) and test loss (right) on a9a.

Data set d n (train) Sparsity n (test) L max L κ

ONR 58 31,715 72.01% 7,929 0.0004 2 1.34221 107

Table 5: Summary of ONR data set

The objective function thus can be written in matrix form as

n(z Ax) (z Ax) + λ

2 x x, (83)

where z(n 1) and A(n d) are data samples from Online News Popularity (ONR for short) data set (Fernandes and Sernadela, 2015) with z indicating the responses to be predicted and A being the data matrix. The features and targets are normalized to the interval [0, 1]. Since ONR does not have a testing set, we randomly split it into the training set and testing set with 80% for training and 20% for testing. We know that the condition number of (83) admits a closed form expression κ = λmax(H)/λmin(H), where H is the hessian matrix of (83), which can be written as H = (2/n)A A + λ I, and λmax, λmin represent the largest and smallest eigenvalue of given matrix. The penalty parameter λ is set to be 10 6 so that the corresponding condition number can be extremely large. We report the condition number of (83) and some other information about ONR data set in Table 5.

Feng and Zhou

Figure 16: Comparisons of the train loss (left), square norm of full gradient (middle) and test loss (right) on ONR.

In the following experiments, we compare the performances of GGD, MBSGD and SGDIS with diminishing stepsize sequences, the performances of Adam and GAdam and the performances of SVRG and GSVRG. For fair comparison, the subsampled set size m is set to be 32 and the batch size b = 2 for grafting gradient based methods. The mini-batch size of mini-batch SGD, SVRG and Adam is set to be 64 and the one-shot importance sampling probability is given by Pri Lri. The t-inverse learning rate schedule is used for methods with diminishing stepsize sequences. For variance reduction methods, the update period is set to be n/m and for adaptive stepsize methods, the hyperparameters are set by default. The results are presented in Figure 16.

GGD: Grafting Gradient Descent

LAYER SHAPE OUTPUT

data layer 1 32 32 conv1-BN-Re LU 6 5 5 6 28 28 MAX-pool2 6 2 2 6 14 14

conv3-BN-Re LU 16 5 5 16 10 10 MAX-pool4 16 2 2 16 5 5 Fully-connect5-Re LU 240 120 Fully-connect6-Re LU 120 84 Fully-connect7 84 10 softmax-loss 10 10

Table 6: Complete architecture for Le Net-5

From these results, we can see that grafting gradient based methods can still achieve comparable performances even when the condition number is extremely large. Results from Figures 16(a) and 16(c) again confirm the doubly robust property possessed by GGD. The performances of MBSGD and GGD are quite similar in terms of train loss, and GGD and SGD-IS achieves a lower test loss. However, SGD-IS does not perform well in terms of train loss. In Figrue 16(a), the improvement of GGD over SGD-IS is significant. In our view, the reason behind the bad performances of SGD-IS on ONR data set may be oversampling. From Table 5, Lmax/ L 5000 indicates that some data of great importance, i.e., with relatively large L constant, may be sampled over and over again. Hence, the training process of SGD-IS is more like minimizing the loss with respect to a small subset which contains important data samples instead of minimizing the average loss of whole training data set. In constrast, over-sampling important samples is less likely to occur in the training process of GGD as illustrated in our toy example. As for the comparison between Adam and GAdam, their performances are quite similar. For the variance reduction methods, the improvement of GSVRG over SVRG is obvious. GSVRG achieves a lower train loss and f(xk) 2. Although GSVRG and SVRG achieve comparable test losses, GSVRG uses less epoches to reach that goal.

C.4 Test Loss Results and Network Architectures

Figures 17, 18 and 19 plot the test losses of previous experiments given in Section 6. From these results we can see that when minimizing the L2-regularized logistic loss, the performances of GGD based methods and SGD based methods are quite similar except that GGD achieves a lower test loss for covtype and rcv1 data set compared with vanilla SGD, and GGD with diminishing stepsize sequence achieve a lower test loss for ijcnn data set compared with SGD with importance sampling. When training CNNs on MNIST and CIFAR-10 data sets, only SVRG achieves lower test loss than GSVRG for MNIST data set and Adam achieves lower test loss than GAdam for CIFAR-10 data set. In the end, the complete architectures for two CNNs are listed in Tables 6 and 7. We use tuple (C H W) in which C represents the number of channels, H is the height in pixels and W is the width in pixels to show the shape of input data and layers of two CNNs.

Feng and Zhou

Figure 17: Comparisons of the test loss between SGD, GGD, mini-batch SGD, SGD with importance sampling methods on ijcnn1, a9a, covtype and rcv1.

GGD: Grafting Gradient Descent

Figure 18: Comparisons of the test loss between Adam, GAdam, SVRG and GSVRG methods on ijcnn1, a9a, covtype and rcv1.

Feng and Zhou

Figure 19: Comparisons of the test loss between MBSGD and GGD, Adam and GAdam, SVRG and GSVRG methods respectively on MNIST and CIFAR-10.

GGD: Grafting Gradient Descent

LAYER SHAPE OUTPUT

data layer 3 28 28 conv1-Re LU 128 3 3 128 28 28 conv2-Re LU 128 3 3 128 28 28 conv3-BN-Re LU 128 3 3 128 28 28 MAX-pool3 128 3 3 128 14 14

conv4-Re LU 256 3 3 256 14 14 conv5-Re LU 256 3 3 256 14 14 conv6-BN-Re LU 256 3 3 256 14 14 MAX-pool6 256 3 3 256 7 7 conv7-Re LU 320 3 3 320 5 5 conv8-Re LU 320 1 1 320 5 5 conv9-Re LU 10 1 1 10 5 5 AVE-pool9 10 5 5 10 1 1 softmax-loss 10 10

Table 7: Complete architecture for cifar10-nv

Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research, 18(1):8194 8244, 2017.

Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In International conference on machine learning, pages 699 707. PMLR, 2016.

Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in neural information processing systems, 30, 2017.

L eon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311, 2018.

Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. Advances in Neural Information Processing Systems, 24, 2011.

Dominik Csiba and Peter Richt arik. Importance sampling for minibatches. The Journal of Machine Learning Research, 19(1):962 982, 2018.

Aaron Defazio and L eon Bottou. On the ineffectiveness of variance reduced optimization for deep learning. Advances in Neural Information Processing Systems, 32, 2019.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems, 27, 2014.

Feng and Zhou

Alexandre D efossez, L eon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad. ar Xiv preprint ar Xiv:2003.02395, 2020.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7), 2011.

Ayoub El Hanchi and David Stephens. Adaptive importance sampling for finite-sum optimization and sampling with decreasing step-sizes. Advances in Neural Information Processing Systems, 33:15702 15713, 2020.

Vinagre Pedro Cortez Paulo Fernandes, Kelwin and Pedro Sernadela. Online News Popularity. UCI Machine Learning Repository, 2015. DOI: https://doi.org/10.24432/C5NS3V.

Nidham Gazagnadou, Robert Gower, and Joseph Salmon. Optimal mini-batch and step sizes for saga. In International Conference on Machine Learning, pages 2142 2150. PMLR, 2019.

Igor Gitman and Boris Ginsburg. Comparison of batch normalization and weight normalization algorithms for the large-scale image classification. ar Xiv preprint ar Xiv:1709.08145, 2017.

Eduard Gorbunov, Filip Hanzely, and Peter Richt arik. A unified theory of sgd: Variance reduction, sampling, quantization and coordinate descent. In International Conference on Artificial Intelligence and Statistics, pages 680 690. PMLR, 2020.

Robert M. Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richt arik. Sgd: General analysis and improved rates. In International Conference on Machine Learning, pages 5200 5209. PMLR, 2019.

Robert M. Gower, Mark Schmidt, Francis Bach, and Peter Richt arik. Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968 1983, 2020.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225 1234. PMLR, 2016.

Samuel Horv ath and Peter Richt arik. Nonconvex variance reduced optimization with arbitrary sampling. In International Conference on Machine Learning, pages 2781 2789. PMLR, 2019.

Xinmeng Huang, Kun Yuan, Xianghui Mao, and Wotao Yin. An improved analysis and rates for variance reduction under without-replacement sampling orders. Advances in Neural Information Processing Systems, 34:3232 3243, 2021.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 2013.

Tyler B Johnson and Carlos Guestrin. Training deep models faster with robust, approximate importance sampling. Advances in Neural Information Processing Systems, 31, 2018.

GGD: Grafting Gradient Descent

Angelos Katharopoulos and Fran cois Fleuret. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pages 2525 2534. PMLR, 2018.

Ahmed Khaled and Peter Richt arik. Better theory for sgd in the nonconvex world. ar Xiv preprint ar Xiv:2002.03329, 2020.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Jakub Koneˇcn y, Jie Liu, Peter Richt arik, and Martin Tak aˇc. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242 255, 2015.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. Advances in Neural Information Processing Systems, 25, 2012.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

Yanli Liu, Yuan Gao, and Wotao Yin. An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33:18261 18271, 2020.

Grigory Malinovsky, Alibek Sailanbayev, and Peter Richt arik. Random reshuffling with variance reduction: New analysis and better rates. ar Xiv preprint ar Xiv:2104.09342, 2021.

Konstantin Mishchenko, Ahmed Khaled, and Peter Richt arik. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33:17309 17320, 2020.

Thomas M uller, Brian Mc Williams, Fabrice Rousselle, Markus Gross, and Jan Nov ak. Neural importance sampling. ACM Transactions on Graphics (To G), 38(5):1 19, 2019.

Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Advances in Neural Information Processing Systems, 27, 2014.

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.

Feng and Zhou

Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak aˇc. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613 2621. PMLR, 2017.

Michael JD Powell. On search directions for minimization algorithms. Mathematical programming, 4:193 201, 1973.

Xun Qian, Zheng Qu, and Peter Richt arik. Saga with arbitrary sampling. In International Conference on Machine Learning, pages 5190 5199. PMLR, 2019.

Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnab as P oczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International Conference on Machine Learning, pages 314 323. PMLR, 2016.

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. ar Xiv preprint ar Xiv:1904.09237, 2019.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407, 1951.

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83 112, 2017.

Othmane Sebbouh, Nidham Gazagnadou, Samy Jelassi, Francis Bach, and Robert Gower. Towards closing the gap between the theory and practice of svrg. Advances in neural information processing systems, 32, 2019.

Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical Programming, 127(1):3 30, 2011.

Fanhua Shang, Kaiwen Zhou, Hongying Liu, James Cheng, Ivor W. Tsang, Lijun Zhang, Dacheng Tao, and Licheng Jiao. Vr-sgd: A simple stochastic variance reduction method for machine learning. IEEE Transactions on Knowledge and Data Engineering, 32(1): 188 202, 2018.

Tijmen Tieleman, Geoffrey Hinton, et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012.

Phuong Thi Tran et al. On the convergence proof of amsgrad and a new version. IEEE Access, 7:61706 61716, 2019.

Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. The Journal of Machine Learning Research, 21(1):9047 9076, 2020.

Zhuang Yang, Zengping Chen, and Cheng Wang. Accelerating mini-batch sarah by step size rules. Information Sciences, 558:157 173, 2021.

GGD: Grafting Gradient Descent

Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5650 5659. PMLR, 2018.

Matthew D. Zeiler. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012.

Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In International Conference on Machine Learning, pages 1 9. PMLR, 2015.

Kaiwen Zhou, Fanhua Shang, and James Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In International Conference on Machine Learning, pages 5980 5989. PMLR, 2018.

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11127 11135, 2019.