# variance_reduction_for_faster_nonconvex_optimization__ab47b023.pdf

Variance Reduction for Faster Non-Convex Optimization

Zeyuan Allen-Zhu ZEYUAN@CSAIL.MIT.EDU Princeton University

Elad Hazan EHAZAN@CS.PRINCETON.EDU Princeton University

Abstract We consider the fundamental problem in nonconvex optimization of efﬁciently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on ﬁrst-order nonconvex optimization remain to be full gradient descent that converges in O(1/ε) iterations for smooth objectives, and stochastic gradient descent that converges in O(1/ε2) iterations for objectives that are sum of smooth functions.

We provide the ﬁrst improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our ﬁrst-order minibatch stochastic method converges with an O(1/ε) rate, and is faster than full gradient descent by Ω(n1/3).

We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.1

1 Introduction

Numerous machine learning problems are naturally formulated as non-convex optimization problems. Examples include inference in graphical models, unsupervised learning models such as topic models, dictionary learning, and perhaps most notably, training of deep neural networks. Indeed, non-convex optimization for machine learning is one of the ﬁelds main research frontiers.

1Full version of this paper ﬁrst appeared online on March 17, 2016 on ar Xiv: http://arxiv.org/abs/1603.05643.

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

Since global minimization of non-convex functions is NPhard, various alternative approaches are applied. For some models, probabilistic and other assumptions on the input can be used to give specially designed polynomial-time algorithms (Arora et al., 2014; 2013; Hsu et al., 2012).

However, the multitude and diversity of machine learning applications require a robust, generic optimization method that can be applied as a tool rather than reinvented per each speciﬁc model. One approach is the design of global nonconvex heuristics such as simulated annealing or bayesian optimization. Although believed to fail in the worst case due to known complexity results, such heuristics many times perform well in practice for certain problems.

Another approach, which is based on more solid theoretical foundation and is gaining in popularity, is to drop the global optimality requirement and attempt to reach more modest solution concepts. The most popular of these is the use of iterative optimization methods to reach a stationary point. The use of stochastic ﬁrst-order methods is the primary focus of this approach, which has become the most common method for training deep neural nets.

Formally, in this paper we consider the unconstrained minimization problem

minx Rd n f(x) def = 1

n Pn i=1 fi(x) o , (1.1)

where each fi(x) is differentiable, possibly non-convex, and has L-Lipschitz continuous gradient (a.k.a. L-smooth) for some parameter L > 0.2 Many machine learning/imaging processing problems fall into Problem (1.1), including training neural nets, ERM (empirical risk minimization) with non-convex losses, and many others.

Following the classical benchmark for non-convex optimization (see for instance (Ghadimi & Lan, 2015)), we focus on algorithms that can efﬁciently ﬁnd an approximate stationary point x satisfying f(x) 2 ε.

2Even if each fi(x) is not smooth but only Lipschitz continuous, standard smoothing techniques such as Chapter 2.3 of (Hazan, 2015) usually turn each fi(x) into a smooth function without sacriﬁcing too much accuracy.

Variance Reduction for Faster Non-Convex Optimization

Unlike convex optimization, a point with small gradient may only be close to a saddle point or a local minimum, rather than the global minimum. Therefore, such an algorithm is usually combined with saddle-point or local-minima escaping schemes, such as genetic algorithms or simulated annealing. More recently, Ge et al. (Ge et al., 2015) also demonstrated that a simple noiseaddition scheme is sufﬁcient for stochastic gradient descent to escape from saddle points.

However, for the general problem (1.1) where smoothness is the only assumption and ﬁnding approximate stationary point is the simple goal, the only known theoretical convergence results remain to be that for gradient descent (GD) and stochastic gradient descent (SGD).

Given a starting point x0, GD applies an update x x 1

L f(x) with a ﬁxed step length 1/L per iteration. In order to produce an output x that is an ε-approximate stationary point, GD needs T = O L(f(x0) f(x ))

ε iterations where x is the global minimizer of f( ). This is a folklore result in optimization and included for instance in (Ghadimi & Lan, 2015). SGD applies an update x x η fi(x) per iteration, where i chosen uniformly at random from [n] def = {1, 2, . . . , n}. If η is properly tuned, one can obtain an ε-approximate stationary point in T = O L

ε2 (f(x0) f(x )) iterations, where σ is the variance of the stochastic gradient. This result is perhaps ﬁrst formalized by Ghadimi and Lan (Ghadimi & Lan, 2015).

Since computing the full gradient f( ) is usually n times slower than that of fi(x), each iteration of SGD is usually n times faster than that of GD, but the total number of iterations for SGD is very poor.

Before our work, it is an open question to design a ﬁrstorder method that is faster than both GD and SGD.

1.1 Our Result

We prove that variance reduction techniques, based on the SVRG method (Johnson & Zhang, 2013), produce an εstationary point in only O n2/3L(f(x0) f(x ))

ε iterations. Since each iteration of SVRG is as fast as SGD and n times faster than that of GD, SVRG is guaranteed to be at least Ω(n1/3) times faster than GD. Among ﬁrst-order methods, this is the ﬁrst time the performance of GD is outperformed in theory for problem (1.1) without any additional assumption, and also the ﬁrst time that stochastic-gradient based methods are shown to have a non-trivial3 1/ε convergence rate independent of the variance σ2.

Our proposed algorithm is very analogous to SVRG of

3Note however, designing a stochastic-gradient method with a trivial 1/ε rate is obvious. For instance, it is straightforward to design such a method that converges in O n L(f(x0) f(x ))

iterations. However, this is never faster than GD.

(Johnson & Zhang, 2013). Recall that SVRG has an outer loop of epochs, where at the beginning of each epoch, SVRG deﬁnes a snapshot vector ex to be the average vector of the previous epoch,and computes its full gradient f(ex). Each epoch of SVRG consists of m inner iterations, where the choice of m usually depends on the objective s strong convexity. In each inner iteration inside an epoch, SVRG picks a random i [n], deﬁnes the gradient estimator e k def = 1

n Pn j=1 fj(ex) + fi(xk) fi(ex) , (1.2)

and performs an update x x η e k for some ﬁxed step length η > 0 across all iterations and epochs.

In order to prove our theoretical result in this paper, we make the following changes to SVRG. First, we set the number of inner iterations m as a constant factor times n. Second, we pick the snapshot point to be a non-uniform average of the last m2/3 elements of the previous epoch. Finally, we prove that the average norm f(xk) 2 of the encountered vectors xk across all iterations is small, so it sufﬁces to output xk for a random k.

Our Technique. To prove our result, we need different techniques from all known results on variance reduction. The key idea used by previous authors is to show that the variance of the gradient estimator e k is upper bounded by either O(f(xk) f(x )) or O( xk x 2), and therefore it converges to zero for convex functions. This analysis fails to apply in the non-convex setting because gradient-based methods do not converge to the global minimum.

We observe in this paper that the variance is upper bounded by O( xk ex 2), the squared distance between the current point and the most recent snapshot. By dividing an epoch into m1/3 subepochs of length m2/3, and performing a mirror-descent analysis for each subepoch, we further show that this squared distance is related to the objective decrease f(ex) f(xk). This would sufﬁce for proving our theorem: whenever this squared distance is small the objective is decreased by a lot due to the small variance, or otherwise if this squared distance is large we still experience a large objective decrease because it is related to f(ex) f(xk).

Applications. There are many machine learning problems that fall into category (1.1). To mention just two:

NON-CONVEX LOSS IN ERM Empirical risk minimization (ERM) problems naturally fall into the category of (1.1) if the loss functions are non-convex. For instance, for binary classiﬁcation problems, the sigmoid function or more broadly, any natural smoothed variant of the 0-1 loss function is not only a more natural choice than artiﬁcial ones such as hinge loss, logistic loss, squared loss, but also generalize better in terms of testing accuracy especially when there are outliers (Shalev-Shwartz et al., 2011).

Variance Reduction for Faster Non-Convex Optimization

However, since sigmoid loss is non-convex, it was previously considered hard to train an ERM problem with it. Shalev-Shwartz, Shamir and Sridharan (Shalev-Shwartz et al., 2011) showed that this minimization problem is still solvable in the improper learning sense, with the help from kernel methods and gradient descent. However, their theoretical convergence has a poor polynomial dependence on 1/ε and exponential dependence on the smoothness parameter of the loss function. Our result in this paper applies to ERM problems with non-convex loss. Suppose we are given n training examples {(a1, ℓ1), . . . (an, ℓn)}, where each ai Rd is the feature vector of example i and each li { 1, +1} is the binary label of example i. By setting φ(t) def = 1 1+et to be the sigmoid loss function and setting fi(x) def = φ(li ai, x ) + λ

2 x 2, problem (1.1) becomes ℓ2 ERM with sigmoid loss. We shall demonstrate in our experiment section that, by using SVRG to train ERM with sigmoid loss, its running time is as good as using SVRG to train ERM with other convex loss functions, but the testing accuracy can be signiﬁcantly better. NEURAL NETWORK Training neural nets can also be formalized into problem (1.1). For instance, as long as the activation function of each neural node is smooth, say the sigmoid function or a smooth version of the rectiﬁed linear unit (Re LU) function (for instance, the softplus alternative), we can deﬁne fi(x) to be the training loss with respect to the i-th data input. In this language, computing the stochastic gradient fi(x) for some random i [n] corresponds to performing one forward-backward prorogation on the neural net with respect to sample i. We shall demonstrate in our experiment that using SVRG to train neural nets can enjoy a much faster running time comparing to SGD or SVRG.

1.2 Extensions Mini-Batch. Our result in this paper trivially extends to the mini-batch setting: if in each iteration we select fi( ) for more than one random indices i, then we can accordingly deﬁne the gradient estimator and the result of this paper still holds. Note that the speed up that we obtain in this case comparing to gradient descent is O((n/b)1/3) where b is the mini-batch size. Therefore, the smaller b is the better sequential running time we expect to see (which is also observed in our experiments).

Other Smoothness Assumptions. Our result generalizes to the setting when each fi( ) enjoys a different smoothness parameter. In this setting one needs to select a random index i [n] with a non-uniform distribution in order to obtain a faster running time. Our result also generalizes to the upper-lower smoothness setting. Instead of requiring each fi( ) to be L-smooth, one can assume it is L-upper smooth and l-lower smooth, a notation introduced by (Allen-Zhu & Yuan, 2016); in such a case, faster results can also be

obtained using our same proof techniques.

Sum-of-Non-Convex Objectives. Our analogous proof also applies to the sum-of-non-convex setting which is the same Problem (1.1) except f( ) is guaranteed to be σstrongly convex. Our obtained running time is e O(n + n L/σ) for SVRG. This is faster than the previous running time on SVRG which is e O(n + L2/σ2), however, it is not faster than using SVRG+Catalyst which gives e O(n+n3/4

L/ σ), see discussion in (Allen-Zhu & Yuan, 2016). We do not include the details about this proof because it does not outperform SVRG+Catalyst.

Other Variance-Reduction Methods. Our proof generalizes to all variance-reduction methods. However, for brevity we demonstrate it only for the SVRG algorithm.

1.3 Other Related Works

For convex objectives, ﬁnding stationary points (or equivalently the global minimum) for problem (1.1) has received lots of attentions across machine learning and optimization communities; many ﬁrst-order methods (Johnson & Zhang, 2013; Shalev-Shwartz & Zhang, 2013; Schmidt et al., 2013; Defazio et al., 2014) as well as their accelerations (Shalev-Shwartz & Zhang, 2014; Lin et al., 2014; Zhang & Xiao, 2015; Allen-Zhu et al., 2016; Allen-Zhu, 2016) have been proposed in the past a few years. Even in the case when f( ) is convex but each fi( ) is non-convex, the problem (1.1) can be solved easily (Shalev-Shwartz, 2015; Garber & Hazan, 2015; Allen-Zhu & Yuan, 2016).

The results of Li and Lin (Li & Lin, 2015) and Ghadimi and Lan (Ghadimi & Lan, 2015) unify the theory of non-convex and convex optimization in the following sense. They provide general ﬁrst-order schemes such that, if the parameters are tuned properly, the schemes can converge (1) as fast as gradient descent in terms of ﬁnding an approximate stationary point; and (2) as fast as accelerated gradient descent (Nesterov, 2004) in terms of minimizing the objective if the function is convex. For the class of (1.1), their methods are only as slow as GD; in contrast, in this paper we prove theoretical convergence that is strictly faster than GD, which is both interesting and unknown.

A few days after the ﬁrst version of this paper appeared on ar Xiv, we became aware of another group of authors that have independently obtained essentially the same result (Reddi et al., 2016a;b). 4

4These results also address gradient dominated functions, for which our main theorem also applies as follows. A nonconvex function f( ) is τ-gradient dominated if f(x) f(x ) τ f(x) 2 for every x. Since our main theorem implies one can obtain x satisfying f(x) 2 1 2τ (f(x0) f(x )) using O n + n2/3Lτ stochastic gradients, by repeating it log2(1/ε) times, we obtain an ε-minimizer of f( ) in O (n + n2/3Lτ) log(1/ε) stochastic gradients.

Variance Reduction for Faster Non-Convex Optimization

Algorithm 1 Simpliﬁed SVRG method in the non-convex setting

Input: xφ a starting vector, S number of epochs, m number of iterations per epoch, η step length. 1: x1 0 xφ

2: for s 1 to S do 3: eµ f(xs 0) 4: for k 0 to m 1 do 5: Pick i uniformly at random in {1, , n}. 6: e fi(xs k) fi(xs 0) + eµ

7: xs k+1 = xs k η e 8: end for 9: xs+1 0 xs m 10: end for 11: return xs k for some random s {1, 2, . . . , S} and random k {1, 2, . . . , m}.

2 Notations and Algorithm

A differentiable function f : Rn R is L-smooth if for all pairs x, y Rn it satisﬁes f(x) f(y) L x y . An equivalent deﬁnition says for for all x, y Rn:

2 x y 2 f(x) f(y) f(y), x y L

2 x y 2 . (2.1)

The main body of this paper proves our result based on three false simpliﬁcation assumptions 4.2, 4.3 and 4.4 for the sake of sketching the high-level intuitions and highlighting the differences between our proof and known results. Our formal convergence proof is quite technical and included in the full paper.

In this high-level proof, we consider Algorithm 1, a simpliﬁed version of our ﬁnal SVRG method for the non-convex setting. Note that both the snapshot point and the starting iterate xs 0 of the s-th epoch have been chosen as the last iterate of the previous epoch in Algorithm 1.

Remark 2.1. In our ﬁnal proof, we instead choose xs 0 to be a weighted average of the last m2/3 iterates from the previous epoch. See Algorithm 2 in the full paper. We demonstrate in Section 6 that this also a better choice in practice.

Throughout this paper we denote by xs k the k-th iterate of the s-th epoch, by s k = f(xs k) the full gradient at this iterate, and by e s k = f(xs 0) + if(xs k) if(xs 0) the gradient estimator which clearly satisﬁes Ei[e s k] = s k. We denote by is k the random index i chosen at iteration k of epoch s. We also denote by (σs k)2 def = s k e s k 2 the variance of the gradient estimator e s k. Under these notations, our simpliﬁed SVRG algorithm in Algorithm 1 simply performs update xs k+1 xs k η e s k for a ﬁxed step length η > 0 that shall be speciﬁed later.

Since we focus mostly on analyzing a single epoch, when it

is clear from the context, we drop the superscript s and denote by xk, ik, k, e k, σ2 k respectively for xs k, is k, s k, e s k, (σs k)2. We also denote by 2 k

def = k 2 2, σ2 i:j

def = Pj k=i σ2 k and 2 i:j

def = Pj k=i 2 k for notational simplicity.

3 Two Useful Lemmas

We ﬁrst observe two simple lemmas. The proofs of both of them can be found in the full paper. The ﬁrst one describes the expected objective decrease between two consecutive iterations. This is a standard step that is used in analyzing gradient descent for smooth functions, and we additionally take into account the variance of the gradient estimator.

Lemma 3.1 (gradient descent). If xk+1 = xk η e k for some gradient estimator e k satisfying E[e k] = k = f(xk), and if the step length η 1

f(xk) E[f(xk+1)] η

The next lemma follows from the classical analysis of mirror descent methods.However, we make novel use of it on top of a non-convex but smooth function.

Lemma 3.2 (mirror descent). If xk+1 = xk η e k for some gradient estimator e k satisfying E[e k] = k = f(xk), then for every u Rd it satisﬁes

f(xk) f(u) η

2 2 k + E[σ2 k]

2 xk u 2 1 2ηE xk+1 u 2 .

Our main theorem is motivated by the linear-coupling framework (Allen-Zhu & Orecchia, 2014). In particular, we linearly couple the above gradient and mirror descent lemmas, together with a variance upper-bound lemma described in the next section.

4 Upper Bounding the Variance

High-Level Ideas. The key idea behind all variancereduction literatures (such as SVRG (Johnson & Zhang, 2013), SAGA (Defazio et al., 2014), and SAG (Schmidt et al., 2013)) is to prove that the variance E[(σs k)2] decreases as s or k increases. However, the only known technique to achieve so is to upper bound E[(σs k)2] essentially by O f(xs k) f(x ) , the objective distance to the minimum. Perhaps the only exception is the work on sum-of-non-convex but strongly-convex objectives (Shalev-Shwartz, 2015; Allen-Zhu & Yuan, 2016), where the authors upper bound E[(σs k)2] by O xs k x 2 , the squared vector distance to the minimum. Such techniques fail to apply in our non-convex setting, because gradient-descent based methods do not necessarily converge to the global minimum.

We take a different path in this paper. We upper bound E[(σs k)2] by O xs k xs 0 2 , the squared vector distance between the current vector xs k and the ﬁrst vector (i.e., the snapshot) xs 0 of the current epoch s. This is certainly tighter

Variance Reduction for Faster Non-Convex Optimization

than O xs k x 2 from prior work.5 Moreover, the less we move away from the snapshot, the better upper bound we obtain on the variance. This is conceptually different from all existing literatures.

Furthermore, we in turn argue that xs k xs 0 2 is at most some constant times f(xs k) f(xs 0). To prove so, we wish to select u = xs 0 in Lemma 3.2 and telescope it for multiple iterations k, ideally for all the iterations k within the same epoch. This is possible for convex objectives but impossible for non-convex ones. More speciﬁcally, the ratio between (1/2η + L/2) and (1/2η) can be much larger than 1, preventing us from telescoping more than O(1/ηL) iterations in any meaningful manner (see (4.1)). In contrast, this ratio would be identical to 1 in the convex setting, or even smaller than 1 in the strongly convex setting. For this reason, we deﬁne η = 1 m2/3L, divide each epoch into O(m1/3) subepochs each consisting of O(m2/3) consecutive iterations. Now we can telescope Lemma 3.2 for all the iterations within a subepoch because m2/3 O(1/ηL). Finally, we use vector inequalities (see (4.5)) to combine these upper bounds for sub-epochs into an upper bound on the entire epoch.

Technical Details. We choose η = 1 m0L for some parameter m0 that divides m. We will set m0 to be m2/3 and the reason will become clear at the end of this section. Deﬁne d = m/m0 so an epoch is divided into d sub-epochs.

We make the following parameter choices

Deﬁnition 4.1. Deﬁne β0 = 1 and βt def = (1 + ηL) t = (1 + 1/m0) t for t = 1, . . . , m0 1. We have 1 βt 1/e > 1/3.

For each k = 0, 1, . . . , m m0, we sum up Lemma 3.2 for iterations k, k + 1, . . . , k + m0 1 with multiplicative weights β0, . . . , βm0 1 respectively. The norm square terms shall telescope in this summation, and we arrive at Pm0 1 t=0 βt f(xk+t) f(u) η

2 Pm0 1 t=0 βt 2 k+t+σ2 k+t

2 xk u 2 βm0 1

2η xk+m0 u 2 .

Simpliﬁcation 4.2. Since the weights β0, . . . , βm0 1 are within each other by a constant factor, let us assume for simplicity that they are all equal to 1.

If we choose u = xk and assume βt = 1 for all t, we can rewrite (4.1) as

1 m0 Pm0 1 t=0 f(xk+t) f(xk) η

2 1 m0 2 k:k+m0 1 + σ2 k:k+m0 1 1 6ηm0 xk+m0 xk 2 . (4.2)

Simpliﬁcation 4.3. Since the left hand side of

5This new technique has also been applied to convex settings recently (Allen-Zhu, 2016).

(4.2) is describing the average objective value f(xk), f(xk+1), . . . , f(xk+m0 1) which is hard to analyze, let us assume for simplicity that it can be replaced with the last iteration in this subepoch, that is

f(xk+m0) f(xk) η

2 1 m0 2 k:k+m0 1 + σ2 k:k+m0 1

1 6ηm0 xk+m0 xk 2 . (4.3)

Using the above inequality we provide a novel upper bound on the variance of the gradient estimator:

= Eit fit(xt) fit(x0) f(xt) f(x0) 2

Eit fit(xt) fit(x0) 2

L2 xt x0 2 . (4.4)

Above, the ﬁrst inequality is because for any random vector ζ Rd, it holds that E ζ Eζ 2 = E ζ 2 Eζ 2, and the second inequality is by the smoothness of each fi( ).

In particular, for t = m, we can upper bound σ2 m using (4.4) and multiple times of (4.3):

E[σ2 m] L2E xm x0 2 L2d E xm xm m0 2

+ xm m0 xm 2m0 2 + + xm0 x0 2

L2d E h 6ηm0 f(x0) f(xm) + 3η2 2 0:m 1 + σ2 0:m 1 i .

(4.5) Above, the ﬁrst inequality follows from the vector inequality v1 + + vd 2 d v1 2 + + vd 2 , and the second inequality follows from (4.3).

Simpliﬁcation 4.4. Suppose that (4.5) holds not only for σ2 m but actually for all σ2 0, . . . , σ2 m 1, then it satisﬁes

1 m E[σ2 0:m 1] L2d E h 6ηm0 f(x0) f(xm)

+ 3η2 2 0:m 1 + σ2 0:m 1 i . (4.6)

As long as 3η2L2d 1 2m, (4.6) further implies

1 m E[σ2 0:m 1] 12ηm0L2d

E f(x0) f(xm) + η 2m0 2 0:m 1 . (4.7)

This concludes our goal in this section which is to provide an upper bound (4.7) on the (average) variance by a constant times the objective difference f(x0) f(xm).

5 Final Theorem

By applying the gradient descent guarantee Lemma 3.1 to the entire epoch. We obtain that

f(x0) E[f(xm)] E h η 2 2 0:m 1 η2L

2 σ2 0:m 1 i .

Combining this with the variance upper bound (4.7), we immediately have

f(x0) E[f(xm)] η

2E[ 2 0:m 1] 6η3m0m L3d

E[f(x0) f(xm) + η 2m0 2 0:m 1] . (5.1)

Variance Reduction for Faster Non-Convex Optimization

In other words, as long as 6η3m0m L3d 1

2, we arrive at

f(x0) E[f(xm)] η

6E[ 2 0:m 1] . (5.2)

Note that (5.2) is only for a single epoch and can be written as f(xs 0) E[f(xs m)] η

6E[Pm 1 t=0 f(xs t) 2] in the general notation. Therefore, we can telescope it over all epochs s = 1, 2, . . . , S. Since we have chosen xs 0, the initial vector in epoch s, to be xs 1 m , the last vector of the previous epoch, we obtain that

1 Sm PS s=1 Pm 1 t=0 E f(xs t) 2 6 ηSm(f(x1 0) f(x S m))

6(f(xφ) minx f(x))

At this point, if we randomly select s [S] and t [m] at the end of the algorithm, we conclude that

E[ f(xs t) 2] 6(f(xφ) minx f(x))

(We remark here that selecting an average iterate to output is a common step also used by GD or SGD for non-convex optimization. This step is often unnecessarily in practice.)

Finally, let use choose the parameters properly. We simply let m = n be the epoch length. Since we have required 3η2L2d 1 2m and 6η3m0m L3d 1 2 in the previous section, and both these requirements can be satisﬁed when m3 0 12m2, we set m0 = Θ(m2/3) = Θ(n2/3). Accordingly η = 1 m0L = Θ 1 n2/3L . In sum,

Theorem 5.1. Under the simpliﬁcation assumptions 4.2, 4.3 and 4.4, by choosing m = n and η = Θ 1 n2/3L , the produced output x of Algorithm 1 satisﬁes that6

E[ f(x) 2] O L(f(xφ) minx f(x))

In other words, to obtain a point x satisfying f(x) 2 ε, the total number of iterations needed for Algorithm 1 is

Sn = O n2/3L(f(xφ) minx f(x))

The amortized per-iteration complexity of SVRG is at most twice of SGD. Therefore, this is a factor of Ω(n1/3) faster than the full gradient descent method on solving (1.1).

6 Experiments

6.1 Empirical Risk Minimization with Non-Convex Loss

We consider binary classiﬁcation on four standard datasets that can be found on the Lib SVM website (Fan & Lin):

the adult (a9a) dataset (32, 561 training samples, 16, 281 testing samples, and 123 features). the web (w8a) dataset (49, 749 training samples, 14, 951 testing samples, and 300 features). the rcv1 (rcv1.binary) dataset (20, 242 training samples, 677, 399 testing samples, and 47, 236 features). the mnist (class 1) dataset (60, 000 training samples,

6Like in SGD, one can easily apply a Markov inequality to conclude that with probability at least 2/3 we have the same asymptotic upper bound on the deterministic quantity f(x) 2.

10, 000 testing samples, and 780 features,

Accuracy Experiment. In the ﬁrst experiment we apply SVRG on training the ℓ2-regularized ERM problem with six loss functions: logistic loss, squared loss, smoothed hinge loss (with smoothing parameters 0.01, 0.1 and 1 resp.), and smoothed zero-one loss (also known as sigmoid loss).7 We wish to see how non-convex loss compares to convex ones in terms of testing accuracy (and thus in terms of the generalization error).

For each of the four datasets, we also randomly ﬂip 1/4 fraction, 1/8 fraction, or zero fraction of the training example labels. The purpose of this manipulation is to introduce outliers to the training set. We therefore have 4 3 = 12 datasets in total. We choose epoch length m = 2n as suggested by the paper SVRG for ERM experiments, and use the simple Algorithm 1 for both convex and non-convex loss functions.

We present the accuracy results partially in Figure 1 (and the complete plots are in the full paper). The y-axis represents the classiﬁcation testing accuracy, and the x-axis represents the number of passes to the dataset. (Each iteration of SVRG counts as 1/n pass and the full-gradient computation of SVRG counts as 1 pass.)

These plots are produced based on a fair and careful parameter-tuning and parameter-validation procedure that can be described in the following four steps. Step I: for each of the 12 datasets, we partition the training samples randomly into a training set of size 4/5 and a validation set of size 1/5. Step II: for each of the 12 datasets and each loss function, we enumerate over 10 choices of λ, the regularization parameter. For each λ, we tune SVRG on the training set with different step lengths η and choose the best η that gives the fastest training speed. Step III: for each of the 12 datasets and each loss function, we tune the best λ using the validation set. That is, we use the selected η from Step II to train the linear predictor, and apply it on the validation set to obtain the testing accuracy. We then select the λ that gives the best testing accuracy for the validation set. Step IV: for each of the 12 datasets and each loss function, we apply the validated linear predictor to the testing set and present it in Figure 1 and Figure 4.

We make the following observations from this experiment.

Although sigmoid loss is only comparable to hinge loss or logistic loss for no ﬂip datasets, however, when the input has a lot of outliers (see ﬂip 1/8 and ﬂip 1/4 ),

7For the sigmoid loss, we scale it properly so that its smoothness parameter is exactly 1. Unlike hinge loss, it is unnecessary to consider sigmoid loss with different smoothness parameters: one can carefully verify that by scaling up or down the weight of the ℓ2 regularizer, it is equivalent to changing the smoothness of the sigmoid loss.

Variance Reduction for Faster Non-Convex Optimization

0 10 20 30 40 50

(a) web, no ﬂip

0 10 20 30 40 50

(b) web, ﬂip 1/8 fraction

0 10 20 30 40 50

(c) web, ﬂip 1/4 fraction

Figure 1. Testing accuracy comparison on SVRG for ℓ2-regularized ERM with different loss functions. The full plots for all the 4 datasets can be found in the full paper. Black solid curves represent sigmoid loss, blue dash curves represent square loss, green dash-dotted curves represent logistic loss, and the three red dotted curves represent hinge loss with 3 different smoothing parameters.

sigmoid loss is undoubtedly the best choice. Square loss is almost always dominated because it is not necessarily a good choice for binary classiﬁcation. The running time needed for SVRG on these datasets are quite comparable, regardless of the loss function being convex or not.

Running-Time Experiment. In this second experiment, we ﬁx the regularization parameter λ and compare the training objective convergence between SGD and SVRG for sigmoid loss only.8 We choose four different λ per dataset and present our plots partially in Figure 2 (and the complete plots can be found in the full paper). In these plots, the y-axis represents the training objective value, and the x-axis represents the number of passes to the dataset.

For a fair comparison we need to tune the step length η for each dataset and each choice of λ. For SGD, we enumerate over polynomial learning rates ηk = α (1 + k/n)β

where k is the number of iterations passed; we have made 10 choices of α, considered β = 0, 0.1, . . . , 1.0, and selected the learning rate that gives a fastest convergence. For SVRG, we ﬁrst consider the vanilla SVRG using a constant η throughout all iterations, and select the best η that gives the fastest convergence. This curve is presented in dashed blue in Figure 5. We also implement SVRG with polynomial learning rates ηk = α (1 + k/n)β and tune the best α, β parameters and present the results in dashed black curves in Figure 5.

We make the following observations from this experiment.

Consistent with theory, SVRG is not necessarily a better choice than SGD for large training error ε. However, SVRG enjoys a very fast convergence especially for small ε. The smaller λ is, the more non-convex the objective function becomes. We see that SGD performs more

8This experiment is the minimization problem with respect to all training samples since there is no need to perform validation.

poorer than SVRG in these cases. With only one exception (dataset web with λ = 10 6), choosing a polynomial learning rate does not seem necessary in terms of improving the running time for training ERM problems with non-convex loss. Although not presented in Figure 5, the best-tuned polynomial learning rates for SVRG have smaller exponents β as compared to SGD in each of the 16 plots.

6.2 Neural Network

We consider the multi-class (in fact, 10-class) classiﬁcation problem on CIFAR-10 (60, 000 training samples) and MNIST (10, 000 training samples), two standard image datasets for neural net studies. We construct a toy twolayered neural net, with (1) 64 hidden nodes in the ﬁrst layer, each connecting to a uniformly distributed 4x4 or 5x5 pixel block of the input image and having a smoothed relu (also known as softplus) activation function; (2) 10 output nodes on the second layer, fully connected to the ﬁrst layer and each representing one of the ten classiﬁcation outputs. We consider training such neural networks with the multiclass logistic loss that is a function on the 10 outputs and the correct label. For each of the two datasets, we consider both training the unregularized version, as well as the ℓ2 regularized version with weight 10 3 for CIFAR-10 and 10 4 for MNIST, two parameters suggested by (Johnson & Zhang, 2013).

We implement two classical algorithms: stochastic gradient descent (SGD) with the best tuned polynomial learning rate and adaptive sub Gradient method (Ada Grad) of (Mc Mahan & Streeter, 2010; Duchi et al., 2011) which is essentially SGD but with an adaptive learning rate. We choose a minibatch size of 100 for both these methods. We consider four variants of SVRG, all of which use epoch length m = 5n/b if b is the mini-batch size:

SVRG-1, the simple Algorithm 1 with a best tuned polynomial learning rate and b = 100. SVRG-2, our full Algorithm 2 with a best tuned poly-

Variance Reduction for Faster Non-Convex Optimization

0 20 40 60 80 100 (a) web, λ = 10 3

0 20 40 60 80 100 (b) web, λ = 10 4

0 20 40 60 80 100 (c) web, λ = 10 5

0 20 40 60 80 100 (d) web, λ = 10 6

Figure 2. Training error comparison between SGD and SVRG on ℓ2-regularized ERM with sigmoid loss. The full plots can be found in the full paper. The best-tuned SGD is presented in solid green, the best-tuned SVRG with constant step length is presented in dashed blue, and the best-tuned SVRG is presented in doted black.

0 50 100 150 200

SVRG-1 SVRG-2 SVRG-3 SVRG-4 SGD Ada Grad

(a) cifar-10, λ = 0

0 50 100 150 200

SVRG-1 SVRG-2 SVRG-3 SVRG-4 SGD Ada Grad

(b) cifar-10, λ = 10 3

0 50 100 150 200

SVRG-1 SVRG-2 SVRG-3 SVRG-4 SGD Ada Grad

(c) mnist, λ = 0

0 50 100 150 200

SVRG-1 SVRG-2 SVRG-3 SVRG-4 SGD Ada Grad

(d) mnist, λ = 10 4

Figure 3. Training Error Comparison on neural nets.

nomial learning rate and b = 100.9

SVRG-3, using adaptive learning rate (similar to Ada Grad) on top of SVRG-2 with b = 100. SVRG-4, same as SVRG-3 but with b = 16.

Our training error performance is presented in Figure 3. We also include the testing accuracy in Figure 6 in the appendix. In these plots the y axis represents the training objective value, and the x axis represents the number of passes to the dataset. Each iteration of SGD or SVRG counts as b/n pass of the dataset, and the snapshot fullgradient computation counts as 1 pass.10 From the plots we clearly see a performance advantage for using SVRG-

9That is, we set the initial vector of each epoch to be weighted average of the last (m/b)2/3 vectors from the previous epoch. 10The number of passes to the dataset is a traditional unit for comparing stochastic methods. For ERM problems, it is natural to count each iteration of SVRG as b/n passes of the data rather than 2b/n, because the computation of fi(ex) is free if one efﬁciently stores fi(ex) when the full gradient was computed at ex. However, after our paper has appeared online, we noticed this measurement may not be fair for SGD on training neural networks, because it is memory-inefﬁcient to store fi(ex) when fi comes from a large-scale neural network. For this reason, the sequential per-iteration cost of SVRG can be a factor (2 + 1/5)/(1 + 1/5) = 11/6 greater than SGD. Nevertheless, the extra cost on computing fi(ex) is totally parallelizable (and can be viewed as doubling the mini-batch size), so this may not affect the GPU-based running time of SVRG by that much. We leave it a future work to run SVRG on large-scale network networks because it is beyond the scope of this paper.

based algorithms as compared to SGD or Ada Grad. Furthermore, we observe that the following three features on top of SVRG could further improve its running time:

1. Comparing SVRG-2 with SVRG-1, we see that setting the epoch initial vector to be a weighted average of the last a few iterations of the previous epoch is recommended. 2. Comparing SVRG-3 with SVRG-2, we see that using adaptive learning rates comparing to tuning the best polynomial learning rate is recommended. 3. Comparing SVRG-4 with SVRG-3, we see that a smaller mini-batch size is recommended in terms of the total complexity. In contrast, reducing the mini-batch size is discouraged for SGD or Ada Grad because the variance could blow up and the performances would be decreased (this is also observed by our experiment but not included in the plots).

We hope that the above observations provide new insights for experimentalists working on deep learning.

Acknowledgements

E. Hazan acknowledges support from the National Science Foundation grant IIS-1523815 and a Google research award. Z. Allen-Zhu acknowledges support from a Microsoft research award, no. 0518584.

Variance Reduction for Faster Non-Convex Optimization

Allen-Zhu, Zeyuan. Katyusha: The First Truly Accelerated Stochastic Gradient Method. Ar Xiv e-prints, abs/1603.05953, March 2016.

Allen-Zhu, Zeyuan and Orecchia, Lorenzo. Linear coupling: An ultimate uniﬁcation of gradient and mirror descent. Ar Xiv e-prints, abs/1407.1537, July 2014.

Allen-Zhu, Zeyuan and Yuan, Yang. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives. In ICML, 2016.

Allen-Zhu, Zeyuan, Richt arik, Peter, Qu, Zheng, and Yuan, Yang. Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling. In ICML, 2016.

Arora, Sanjeev, Ge, Rong, Halpern, Yonatan, Mimno, David M., Moitra, Ankur, Sontag, David, Wu, Yichen, and Zhu, Michael. A practical algorithm for topic modeling with provable guarantees. In ICML, pp. 280 288, 2013.

Arora, Sanjeev, Ge, Rong, and Moitra, Ankur. New algorithms for learning incoherent and overcomplete dictionaries. In COLT, pp. 779 806, 2014.

Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In NIPS, 2014.

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121 2159, 2011.

Fan, Rong-En and Lin, Chih-Jen. LIBSVM Data: Classiﬁcation, Regression and Multi-label. Accessed: 2015-06.

Garber, Dan and Hazan, Elad. Fast and simple PCA via convex optimization. Ar Xiv e-prints, September 2015.

Ge, Rong, Huang, Furong, Jin, Chi, and Yuan, Yang. Escaping from saddle points online stochastic gradient for tensor decomposition. In COLT, 2015.

Ghadimi, Saeed and Lan, Guanghui. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, pp. 1 26, feb 2015. ISSN 0025-5610.

Hazan, Elad. DRAFT: Introduction to online convex optimimization. Foundations and Trends in Machine Learning, XX(XX):1 168, 2015.

Hsu, Daniel, Kakade, Sham M., and Zhang, Tong. A spectral algorithm for learning hidden markov models. J. Comput. Syst. Sci., 78(5):1460 1480, 2012.

Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315 323, 2013.

Li, Huan and Lin, Zhouchen. Accelerated Proximal Gradient Methods for Nonconvex Programming. In NIPS, 2015.

Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An Accelerated Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization. In NIPS, pp. 3059 3067, 2014.

Mc Mahan, H. Brendan and Streeter, Matthew. Adaptive Bound Optimization for Online Convex Optimization. In Proceedings of the 23rd Annual Conference on Learning Theory - COLT 10, February 2010.

Nesterov, Yurii. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004. ISBN 1402075537.

Reddi, Sashank J., Hefny, Ahmed, Sra, Suvrit, Poczos, Barnabas, and Smola, Alex. Stochastic variance reduction for nonconvex optimization. Ar Xiv e-prints, abs/1603.06160, March 2016a.

Reddi, Sashank J., Sra, Suvrit, Poczos, Barnabas, and Smola, Alex. Fast incremental method for nonconvex optimization. Ar Xiv e-prints, abs/1603.06159, March 2016b.

Schmidt, Mark, Le Roux, Nicolas, and Bach, Francis. Minimizing ﬁnite sums with the stochastic average gradient. ar Xiv preprint ar Xiv:1309.2388, pp. 1 45, 2013. Preliminary version appeared in NIPS 2012.

Shalev-Shwartz, Shai. SDCA without Duality. ar Xiv preprint ar Xiv:1502.06177, pp. 1 7, 2015.

Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567 599, 2013.

Shalev-Shwartz, Shai and Zhang, Tong. Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization. In ICML, pp. 64 72, 2014.

Shalev-Shwartz, Shai, Shamir, Ohad, and Sridharan, Karthik. Learning kernel-based halfspaces with the 01 loss. SIAM Journal on Computing, 40(6):1623 1646, 2011.

Zhang, Yuchen and Xiao, Lin. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In ICML, 2015.