# almost_tunefree_variance_reduction__da6d5ec9.pdf

Almost Tune-Free Variance Reduction

Bingcong Li 1 Lingda Wang 2 Georgios B. Giannakis 1

The variance reduction class of algorithms including the representative ones, SVRG and SARAH, have well documented merits for empirical risk minimization problems. However, they require grid search to tune parameters (step size and the number of iterations per inner loop) for optimal performance. This work introduces almost tunefree SVRG and SARAH schemes equipped with i) Barzilai-Borwein (BB) step sizes; ii) averaging; and, iii) the inner loop length adjusted to the BB step sizes. In particular, SVRG, SARAH, and their BB variants are ﬁrst reexamined through an estimate sequence lens to enable new averaging methods that tighten their convergence rates theoretically, and improve their performance empirically when the step size or the inner loop length is chosen large. Then a simple yet effective means to adjust the number of iterations per inner loop is developed to enhance the merits of the proposed averaging schemes and BB step sizes. Numerical tests corroborate the proposed methods.

1. Introduction

In this work, we deal with the frequently encountered empirical risk minimization (ERM) problem expressed as

min x Rd f(x) := 1

i [n] fi(x) (1)

where x Rd is the parameter vector to be learned from data; the set [n] := {1, 2, . . . , n} collects data indices; and, fi is the loss function associated with datum i. Suppose that f is µ-strongly convex and has L-Lipchitz continuous gradient. The condition number of f is denoted by κ := L/µ. Throughout, x denotes the optimal solution of (1).

1Uiversity of Minnesota, MN, USA. 2University of Illinois at Urbana-Champaign, IL, USA. Correspondence to: Bingcong Li <lixx5599@umn.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

The standard approach to solve (1) is gradient descent (GD) (Nesterov, 2004), which updates the decision variable via

xk+1 = xk η f(xk)

where k is the iteration index and η the step size (or learning rate). For a strongly convex f, GD convergences linearly to x , that is, xk x 2 (cκ)k x0 x 2 for some κ-dependent constant cκ (0, 1) (Nesterov, 2004).

In the big data regime, however, where n is large, obtaining the gradient per iteration can be computationally prohibitive. To cope with this, the stochastic gradient descent (SGD) reduces the computational burden by drawing uniformly at random an index ik [n] per iteration k, and adopting fik(xk) as an estimate of f(xk). Albeit computationally lightweight with the simple update

xk+1 = xk ηk fik(xk)

the price paid is that SGD comes with sublinear convergence, hence slower than GD (Robbins and Monro, 1951; Bottou et al., 2016). It has been long recognized that the variance E[ fit(xt) f(xt) 2] of the gradient estimate affects critically SGD s convergence slowdown.

This naturally motivated gradient estimates with reduced variance compared with SGD s simple fik(xk). A gradient estimate with reduced variance can be obtained by capitalizing on the ﬁnite sum structure of (1). One idea is to judiciously evaluate a so-termed snapshot gradient f(xs), and use it as an anchor of the stochastic draws in subsequent iterations. Members of the variance reduction family include SVRG (Johnson and Zhang, 2013), SAG (Roux et al., 2012), SAGA (Defazio et al., 2014), MISO (Mairal, 2013), SARAH (Nguyen et al., 2017), and their variants (Konecn y and Richt arik, 2013; Lei et al., 2017; Kovalev et al., 2019; Li et al., 2020). Most of these algorithms rely on the update xk+1 = xk ηvk, where η is a constant step size and vk is an algorithm-speciﬁc gradient estimate that takes advantage of the snapshot gradient. In this work, SVRG and SARAH are of central interest because they are memory efﬁcient compared with SAGA, and have no requirement for the duality arguments that SDCA (Shalev-Shwartz and Zhang, 2013) entails. Variance reduction methods converge linearly when f is strongly convex. To fairly compare the complexity of (S)GD with that of variance reduction algorithms which combine snapshot gradients with the stochastic

Almost Tune-Free Variance Reduction

ones, we will rely on the incremental ﬁrst-order oracle (IFO) (Agarwal and Bottou, 2015).

Deﬁnition 1. An IFO takes fi and x Rd as input, and returns the (incremental) gradient fi(x).

For convenience, IFO complexity is abbreviated as complexity in this work. A desirable algorithm obtains an ϵ-accurate solution satisfying E[ f(x) 2] ϵ or E[f(x) f(x )] ϵ with minimal complexity for a prescribed ϵ. Complexity for variance reduction alternatives such as SVRG and SARAH is O (n+κ) ln 1

ϵ , a clear improvement over GD s complexity O nκ ln 1

ϵ . And when high accuracy (small ϵ) is desired, the complexity of variance reduction algorithms is also lower than SGD s complexity of O 1

ϵ . The merits of gradient estimates with reduced variance go beyond convexity; see e.g., (Reddi et al., 2016; Fang et al., 2018; Cutkosky and Orabona, 2019), but nonconvex ERM are out of the present work s scope.

Though theoretically appealing, SVRG and SARAH entail grid search to tune the step size, which is often painstakingly hard and time consuming. An automatically tuned step size for SVRG was introduced by (Barzilai and Borwein, 1988) (BB) and (Tan et al., 2016). However, since both SVRG and SARAH have a double-loop structure, the inner loop length also requires tuning in addition to the step size. Other works relying on BB step sizes introduce additional tunable parameters on top of the inner loop length (Liu et al.; Yang et al., 2019). In a nutshell, tune-free variance reduction algorithms still have desired aspects to investigate and fulﬁll.

Along with the BB step sizes, this paper establishes that in order to obtain tune-free SVRG and SARAH schemes, one must: i) develop novel types of averaging; and, ii) adjust the inner loop length along with step size as well. Averaging in double-loop algorithms reﬂects the means of choosing the starting point of the next outer loop (Johnson and Zhang, 2013; Tan et al., 2016; Nguyen et al., 2017). The types of averaging considered so far have been employed as tricks to simplify proofs, while in the algorithm itself the last iteration is the most prevalent choice for the starting point of the ensuing outer loop. However, we contend that different averaging methods result in different performance. And the best averaging depends on the choice of other parameters. In addition to averaging, we argue that the choice of the inner loop length for BB-SVRG in (Tan et al., 2016) is too pessimistic. Addressing this with a simple modiﬁcation leads to the desired almost tune-free SVRG and SARAH.

Our detailed contributions can be summarized as follows.

We empirically argue that averaging is not merely a proof trick. It is prudent to adjust averaging in accordance with the step size and the inner loop length.

SVRG and SARAH are analyzed using the notion of

estimate sequence (ES). This prompts a novel averaging that tightens up convergence rate for SVRG, and further improves SARAH s convergence over existing works under certain conditions. Besides tighter rates, our analysis broadens the analytical tool, ES, by endowing it with the ability to deal with SARAH s biased gradient estimate.

The theoretical guarantees for BB-SVRG and BBSARAH with different types of averaging are established and leveraged for performance improvement.

Finally, we offer a principled design of the inner loop length to obtain almost tune-free BB-SVRG and BBSARAH. The choice for the inner loop length is guided by the regime that the proposed averaging schemes favor. Numerical tests further corroborate the efﬁciency of the proposed algorithms.

Notation. Bold lowercase letters denote column vectors; E represents expectation; x stands for the ℓ2-norm of x; and x, y denotes the inner product of vectors x and y.

2. Preliminaries

We will ﬁrst focus on the averaging techniques, whose generality goes beyond BB step sizes. To start with, this section brieﬂy reviews the vanilla SVRG and SARAH, while their BB variants are postponed slightly.

2.1. Basic Assumptions

Assumption 1. Each fi : Rd R has L-Lipchitz gradient, that is, fi(x) fi(y) L x y , x, y Rd.

Assumption 2. Each fi : Rd R is convex.

Assumption 3. Function f : Rd R is µ-strongly convex, that is, there exists µ > 0, such that f(x) f(y) f(y), x y + µ

2 x y 2, x, y Rd.

Assumption 4. Each fi : Rd R is µ-strongly convex, meaning there exists µ > 0, so that fi(x) fi(y) fi(y), x y + µ

2 x y 2, x, y Rd.

Assumption 1 requires each loss function to be sufﬁciently smooth. One can certainly require smoothness of each individual loss function and reﬁne Assumption 1 as fi has Li-Lipchitz gradient. Clearly L = maxi Li. By combining with importance sampling (Xiao and Zhang, 2014; Kulunchakov and Mairal, 2019), such a reﬁned assumption can slightly tighten the κ dependence in the complexity bound. However, since the extension is straightforward, we will keep using the simpler Assumption 1 for clarity. Assumption 3 only requires f to be strongly convex, which is weaker than Assumption 4. Assumptions 1 4 are all standard in variance reduction algorithms.

Almost Tune-Free Variance Reduction

Algorithm 1 SVRG

1: Initialize: x0, η, m, S 2: for s = 1, 2, . . . , S do 3: xs 0 = xs 1

4: gs = f(xs 0) 5: for k = 0, 1, . . . , m 1 do 6: uniformly draw ik [n] 7: vs k = fik(xs k) fik(xs 0) + gs

8: xs k+1 = xs k ηvs k 9: end for 10: draw xs randomly from {xs k}m k=0 according to ps

11: end for 12: Output: x S

Algorithm 2 SARAH

1: Initialize: x0, η, m, S 2: for s = 1, 2, . . . , S do 3: xs 0 = xs 1, and vs 0 = f(xs 0) 4: xs 1 = xs 0 ηvs 0 5: for k = 1, 2, . . . , m 1 do 6: uniformly draw ik [n] 7: vs k = fik(xs k) fik(xs k 1) + vs k 1 8: xs k+1 = xs k ηvs k 9: end for 10: draw xs randomly from {xs k}m k=0 according to ps

11: end for 12: Output: x S

2.2. Recap of SVRG and SARAH

The steps of SVRG and SARAH are listed in Algs. 1 and 2, respectively. Each employs a ﬁne-grained reducedvariance gradient estimate per iteration. For SVRG, vs k is an unbiased estimate since E[vs k|Fs k 1] = f(xs k), where Fs k 1 := σ( xs 1, i0, i1, . . . , ik 1) is the σ-algebra generated by xs 1, i1, i2, . . . , ik 1; while SARAH adopts a biased vs k, that is, E[vs k|Fs k 1] = f(xs k) f(xs k 1) + vs k 1 = f(xs k). The variance (mean-square error (MSE)) of vs k in SVRG (SARAH) can be upper bounded by quantities that dictate the optimality gap (gradient norm square). Lemma 1. (Johnson and Zhang, 2013; Nguyen et al., 2017) The MSE of vs k in SVRG is bounded as follows

SVRG : E f(xs k) vs k 2 E vs k 2 (2a)

4LE f(xs k) f(x ) + 4LE f(xs 0) f(x ) .

The MSE of vs k in SARAH is also bounded as

SARAH : E f(xs k) vs k 2 (2b)

E f(xs 0) 2 E vs k 2 .

Another upper bound on SVRG s gradient estimate is available; see e.g., (Kulunchakov and Mairal, 2019), but it is not

suitable for our analysis. Intuitively, Lemma 1 suggests that if SVRG or SARAH converges, the MSE of their gradient estimates also approaches to zero.

At the end of each inner loop, the starting point of the next outer loop is randomly selected among {xs k}m k=0 according to a pmf vector ps m+1, where m+1 := {p Rm+1 + | 1, p = 1}. We term ps the averaging weight vector, and let ps j denote the jth entry of ps. Leveraging the MSE bounds in Lemma 1 and choosing a proper averaging vector, SVRG and SARAH iterates for strongly convex problems can be proved to converge linearly.

For SVRG, two types of averaging exist.

U-Avg (SVRG) (Johnson and Zhang, 2013): vector ps

is chosen as the pmf of an (almost) uniform distribution, that is, ps m = 0, and ps k = 1/m for k = {0, 1, . . . , m 1}. Under Assumptions 1 3, the choice of η = O(1/L) and m = O(κ) ensures that SVRG iterates converge linearly.1

L-Avg (SVRG) (Tan et al., 2016; Hu et al., 2018): Only the last iteration is used for averaging by setting xs = xs m; or equivalently, by setting ps m = 1, and ps k = 0, k = m. Under Assumptions 1 3, linear convergence is ensured by choosing η = O(1/(Lκ)) and m = O(κ2).

To guarantee linear convergence, SVRG with L-Avg must adopt a much smaller η and larger m compared with UAvg. L-Avg with such a small step size leads to complexity O (n + κ2) ln 1

ϵ that has worse dependence on κ.

For SARAH, there are also two averaging options.

U-Avg (SARAH) (Nguyen et al., 2017): here ps is selected to have entries ps m = 0, and ps k = 1/m, for k = {0, 1, . . . , m 1}. Linear convergence is guaranteed with complexity O (n + κ) ln 1

ϵ under Assumptions 1 3 so long as one selects η = O(1/L) and m = O(κ).

L-Avg (SARAH) (Li et al., 2020)2: here ps is chosen with entries ps m 1 = 1 and ps k = 0, k = m 1. Under Assumptions 1 3 and with η = O(1/L) as well as m = O(κ2), linear convergence is guaranteed at complexity of O (n + κ2) ln 1

ϵ . When both Assumptions 1 and 4 hold, setting η = O(1/L) and m = O(κ) results in linear convergence along with a reduced complexity of order O (n + κ) ln 1

1For simplicity and clarity of exposition we only highlight the order of η and m, and hide other constants in big-O notation. Detailed choices can be found in the corresponding references. 2There is another version of L-Avg for SARAH (Liu et al.), but convergence claims require undesirably small step sizes η = O(µ/L2). This is why we focus on the L-Avg in (Li et al., 2020).

Almost Tune-Free Variance Reduction

U-Avg (for both SVRG and SARAH) is usually employed as a proof-trick to carry out convergence analysis, while L-Avg is implemented most of the times. However, we will argue in the next section that with U-Avg adapted to the step size choice, it is possible to improve empirical performance. Although U-Avg appears at ﬁrst glance to waste updates, a simple trick in the implementation can ﬁx this issue.

Implementation of Averaging. Rather than updating m times and then choosing xs according to Line 10 of SVRG or SARAH, one can generate a random integer M s {0, 1, . . . , m} according to the averaging weight vector ps. Having available xs M s, it is possible to start the next outer loop immediately.

3. Weighted Averaging for SVRG and SARAH

This section introduces weighted averaging for SVRG and SARAH, which serves as an intermediate step for the ultimate tune-free variance reduction. Such an averaging for SVRG will considerably tighten its analytical convergence rate; while for SARAH it will improve its convergence rate when m or η is chosen sufﬁciently large. These analytical results are obtained by reexamining SVRG and SARAH through the estimate sequence (ES), a tool that has been used for analyzing momentum schemes (Nesterov, 2004); see also (Nitanda, 2014; Lin et al., 2015; Kulunchakov and Mairal, 2019). Different from existing ES analysis that relies heavily on the unbiasedness of vs k, our advances here will endow ES with the ability to deal with the biased gradient estimate of SARAH.

3.1. Estimate Sequence

Since in this section we will focus on a speciﬁc inner loop indexed by s, the superscript s is dropped for brevity. For example, xs k and vs k are written as xk and vk, respectively.

Associated with the ERM objective f and a particular point x0, consider a series of quadratic functions {Φk(x)}m k=0 that comprise what is termed ES, with the ﬁrst one given by

Φ0(x) = Φ 0 + µ0

2 x x0 2 (3a)

and the rest deﬁned recursively as

Φk(x) =(1 δk)Φk 1(x) + δk h f(xk 1) (3b)

+ vk 1, x xk 1 + µ

2 x xk 1 2i

where vk 1 is the gradient estimate in SVRG or SARAH; while Φ 0, µ0, and δk are some constants to be speciﬁed later. The design is similar to that of (Kulunchakov and Mairal, 2019), but the ES here is constructed per inner loop. In

addition, here we will overcome the challenge of analyzing SARAH s biased gradient estimate vk.

Upon deﬁning Φ k := minx Φk(x), the key properties of the sequence {Φk(x)}m k=0 are collected in the next lemma.

Lemma 2. For {Φk(x)}m k=0 as in (3), it holds that: i) Φ0(x) is µ0-strongly convex, and Φk(x) is µk-strongly convex with µk = (1 δk)µk 1 + δkµ; ii) xk minimizes Φk(x) if δk = ηµk; and iii) Φ k = (1 δk)Φ k 1 + δkf(xk 1) µkη2

Lemma 2 holds for both SVRG and SARAH. To better understand the role of ES, it is instructive to use an example.

Example. With Φ 0 = f(x0), µ0 = µ, and δk = µkη for SVRG, it holds that µk = µ, k, and δk = µη, k. If for convenience we let δ := µη, we show in Appendix A.2 that

E Φk(x) (1 δ)k Φ0(x) f(x ) + f(x). (4)

As k , one has (1 δ)k 0, and hence Φk(x) approaches in expectation a lower bound of f(x).

Now, we are ready to view SVRG and SARAH through the lens of {Φk(x)}m k=0 to obtain new averaging schemes.

3.2. Weighted Averaging for SVRG

The new averaging vector ps for SVRG together with the improved convergence rate is summarized in the following theorem.

Theorem 1. (SVRG with W-Avg.) Under Assumptions 1 3, construct the ES as in (3) with µ0 = µ, δk = µkη, and Φ 0 = f(x0). Choose η < 1/(4L), and m large enough such that

λSVRG := 1 1 (1 µη)m 1

+ 2µLη2(1 µη)m 1

1 2Lη + 2Lη 1 2Lη

Let ps 0 = ps m = 0, and ps k = (1 µη)m k 1/q for k = 1, 2, . . . , m 1, where q = [1 (1 µη)m 1]/(µη). It then holds for SVRG with this weighted averaging (W-Avg) that

E f( xs) f(x ) λSVRGE f( xs 1) f(x ) .

Comparing the W-Avg in Theorem 1 against U-Avg and LAvg we saw in Section 2.2, the upshot of W-Avg is a much tighter convergence rate. When choosing η = O(1/L), the dominating terms of the convergence rate for W-Avg are O (1 1/κ)m

1 2Lη + 2Lη 1 2Lη , and O κ m(1 2Lη) + 2Lη 1 2Lη

for U-Avg (Johnson and Zhang, 2013). Clearly, the factor (1 1/κ)m in W-Avg can be much smaller than κ/m in U-Avg; see Fig. 1(a) for comparison of convergence rates of

Almost Tune-Free Variance Reduction

10 15 20 25 30 m/

1.5 U-Avg W-Avg

5 10 15 20 25 m/

U-Avg L-Avg W-Avg

(a) SVRG (b) SARAH

Figure 1. A comparison of the analytical convergence rate for SVRG and SARAH. In both ﬁgures we set κ = 105 with L = 1, µ = 10 5, and the step sizes are selected as: (a) SVRG with η = 0.1/L; and (b) SARAH with η = 0.5/L.

different averaging types. Since convergence of SVRG with L-Avg requires η and m to be chosen differently from those in U-Avg and W-Avg, L-Avg is not plotted in Fig. 1(a).

Next, we assess the complexity of SVRG with W-Avg.

Corollary 1. Choosing m = O(κ) and other parameters as in Theorem 1, the complexity of SVRG with W-Avg to ﬁnd xs satisfying E f( xs) f(x ) ϵ is O (n + κ) ln 1

Note that similar to U-Avg, W-Avg incurs lower complexity compared with L-Avg in (Tan et al., 2016; Hu et al., 2018).

3.3. Weighted Averaging for SARAH

SARAH is challenging to analyze due to the bias present in the estimate vk, which makes the ES-based treatment of SARAH fundamentally different from that of SVRG. To see this, it is useful to start with the following lemma.

Lemma 3. For any deterministic x, it holds in SARAH that

E vk f(xk), x xk

τ=0 E h vτ f(xτ) 2 + vτ 2 f(xτ) 2i .

Lemma 3 reveals the main difference in the ES-based argument for SARAH, namely that E vk f(xk), x xk = 0, while the same inner product for SVRG equals to 0 in expectation. Reﬂecting back to (4), the consequence of having a non-zero E vk f(xk), x xk is that E[Φk(x)] is not necessarily approaching a lower bound of f(x) as k ; thus,

E Φk(x) (1 δ)k Φ0(x) f(x) + f(x) + C (5)

where C is a non-zero term that is not present in (4) when applied to SVRG; see detailed derivations in Appendix A.2.

Interestingly, upon capitalizing on the properties of vk, the ensuing theorem establishes linear convergence for SARAH with a proper W-Avg vector ps.

Theorem 2. (SARAH with W-Avg.) Under Assumptions 1 and 4, deﬁne the ES as in (3) with µ0 = µ, δk = µkη, k, and Φ 0 = f(x0). With δ := µη, select η < 1/L and m large enough, so that

λSARAH := (1 δ)m 1 2ηL

cδ + ηL(m 1)

c(2 ηL) + 2 2ηL

where c = m 1

δ . Setting pk = (1 (1 δ)m k 1)/c, k = 0, 1, . . . , m 2, and pm 1 = pm = 0, SARAH with this W-Avg satisfy

E f( xs) 2 λSARAHE f( xs 1) 2 .

The expression of λSARAH is complicated because we want the upper bound of the convergence rate to be as tight as possible. To demonstrate this with an example, choosing η = 1/(2L) and m = 5κ, we have λSARAH 0.8. Fig. 1(b) compares SARAH with W-Avg versus SARAH with U-Avg and L-Avg. The advantage of W-Avg is more pronounced as m is chosen larger.

As far as complexity of SARAH with W-Avg, it is comparable with that of L-Avg or U-Avg, as asserted next.

Corollary 2. Choosing m = O(κ) and other parameters as in Theorem 2, the complexity of SARAH with W-Avg to ﬁnd xs satisfying E[ f( xs) 2] ϵ, is O (n + κ) ln 1

A few remarks are now in order on our analytical ﬁndings: i) most existing ES-based proofs use E[f( xs) f(x )] as optimality metric, while Theorem 2 and Corollary 2 rely on E[ f( xs) 2]; ii) the analysis method still holds when Assumption 4 is weakened to Assumption 3, at the price of having worse κ-dependence of the complexity, that is, O (n + κ2) ln 1

ϵ , which is of the same order as L-Avg under Assumptions 1 3 (Li et al., 2020; Liu et al.).

3.4. Averaging Is More Than A Proof Trick

Existing forms of averaging such as U-Avg and W-Avg, are typically considered as proof tricks for simplifying the theoretical analysis (Johnson and Zhang, 2013; Tan et al., 2016; Nguyen et al., 2017; Li et al., 2020). In this subsection, we contend that averaging can distinctly affect performance, and should be adapted to other parameters. We will take SARAH with η = O(1/L) and m = O(κ) as an example rather than SVRG since such parameter choices guarantee convergence regardless of the averaging employed. (For SVRG with L-Avg on the other hand, the step size has to be chosen differently with W-Avg or U-Avg.)

We will ﬁrst look at the convergence rate of SARAH across different averaging options. Fixing m = O(κ) and changing η, the theoretical convergence rate is plotted in Fig. 2. It

Almost Tune-Free Variance Reduction

0.0 0.2 0.4 0.6 0.8 1.0 L

U-Avg L-Avg W-Avg

Figure 2. SARAH s analytical convergence rate with different averaging options (κ = 105, L = 1, µ = 10 5, and ﬁxed m = 10κ).

is observed that with smaller step sizes, L-Avg enjoys faster convergence, while larger step sizes tend to favor W-Avg and U-Avg instead.

Next, we will demonstrate empirically that the type of averaging indeed matters. Consider binary classiﬁcation using the regularized logistic loss function

i [n] ln 1 + exp( bi ai, x ) + µ

where (ai, bi) is the (feature, label) pair of datum i. Clearly, (6) is an instance of the cost in (1) with fi(x) = ln 1 + exp( bi ai, x ) + µ

2 x 2; and it can be readily veriﬁed that Assumptions 1 and 4 are satisﬁed in this case.

SARAH with L-Avg, U-Avg and W-Avg are tested with ﬁxed (moderate) m = O(κ) but different step size choices on the dataset w7a; see also Appendix D.1 for additional tests with datasets a9a and diabetes. Fig. 3(a) shows that for a large step size η = 0.9/L, W-Avg outperforms U-Avg as well as L-Avg by almost two orders at the 30th sample pass. For a medium step size η = 0.6/L, W-Avg and L-Avg perform comparably, while both are outperformed by UAvg. When η is chosen small, L-Avg is clearly the winner. In short, the performance of averaging options varies with the step sizes. This is intuitively reasonable because the MSE of vk: i) scales with η (cf. Lemma 1); and ii) tends to increase with k as E[ vk 2] decreases linearly (see Lemma 5 in Appendix B.2, and the MSE bound in Lemma 1). As a result, when both η and k are large, the MSE of vk tends to be large too. Iterates with gradient estimates having high MSE can jeopardize the convergence. This explains the inferior performance of L-Avg in Figs. 3(a) and 3(b). On the other hand, when η is chosen small, the MSE tends to be small as well; hence, working with L-Avg does not compromise convergence, while in expectation W-Avg and U-Avg compute full gradient more frequently than L-Avg. These two reasons explain the improved performance of L-Avg in Fig. 3(c).

When we ﬁx η and change m, as depicted in Fig. 1(b), the analytical convergence rate of W-Avg improves over that of U-Avg and L-Avg when m is large. This is because the MSE of vk increases with k. W-Avg and U-Avg ensure better performance through early ending, by reducing the number of updates that utilize vk with large MSE.

In sum, the choice of averaging scheme should be adapted with η and m to optimize performance. For example, the proposed W-Avg for SARAH favors the regime where either η or m is chosen large, as dictated by the convergence rates and corroborated by numerical tests.

4. Tune-Free Variance Reduction

This section copes with variance reduction without tuning. In particular, i) Barzilai-Borwein (BB) step size, ii) averaging schemes, and iii) a time varying inner loop length are adopted for the best empirical performance.

4.1. Recap of BB Step Sizes

Aiming to develop tune-free SVRG and SARAH, we will ﬁrst adopt the BB scheme to obtain suitable step sizes automatically (Tan et al., 2016). In a nutshell, BB monitors progress of previous outer loops, and chooses the step size of outer loop s accordingly via

xs 1 xs 2 2 xs 1 xs 2, f( xs 1) f( xs 2) (7)

where θκ is a κ-dependent parameter to be speciﬁed later. Note that f( xs 1) and f( xs 2) are computed at the outer loops s and s 1, respectively; hence, the implementation overhead of BB step sizes only includes almost negligible memory to store xs 2 and f( xs 2).

BB step sizes for SVRG with L-Avg have relied on θκ = m = O(κ2) (Tan et al., 2016). Such a choice of parameters offers provable convergence at complexity O (n+κ2) ln 1

ϵ , but has not been effective in our simulations for two reasons: i) step size ηs depends on m, which means that tuning is still required for step sizes; and, ii) the optimal m of O(κ) with best empirical performance signiﬁcantly deviates from the theoretically suggested O(κ2); see also Fig. 4(a). Other BB based variance reduction methods introduce extra parameters to be tuned in additional to m; e.g., some scalars in (Liu et al.) and the mini-batch sizes in (Yang et al., 2019). This prompts us to design more practical BB methods how to choose m with minimal tuning is also of major practical value.

4.2. Averaging for BB Step Sizes

We start with a ﬁxed choice of m to theoretically investigate different types of averaging for the BB step sizes. The ﬁnal

Almost Tune-Free Variance Reduction

0 10 20 30 40 #grad/n

SARAH U-Avg SARAH L-Avg SARAH W-Avg

0 10 20 30 40 #grad/n

SARAH U-Avg SARAH L-Avg SARAH W-Avg

0 10 20 30 40 #grad/n

SARAH U-Avg SARAH L-Avg SARAH W-Avg

(a) η = 0.9/L (b) η = 0.6/L (c) η = 0.06/L

Figure 3. Comparing SARAH with different types of averaging on dataset w7a (µ = 0.005 and m = 5κ in all tests).

0 5 10 15 20 25 #grad/n

0 5 10 15 20 25 30 #grad/n

L-Avg U-Avg W-Avg

Figure 4. (a) Performance of BB-SVRG (Tan et al., 2016) under different choices of m. (b) BB-SARAH with different averaging schemes. Both tests use dataset a9a with κ = 1, 388.

tune-free implementation of SVRG and SARAH will rely on the analysis of this subsection.

Proposition 1. (BB-SVRG) Under Assumptions 1 3, if we choose m = O(κ2) and θκ = O(κ) (but with θκ > 4κ), then BB-SVRG with U-Avg and W-avg can ﬁnd xs with E f( xs) f(x ) ϵ using O (n + κ2) ln 1

ϵ IFO calls.

Similar to BB-SVRG, the ensuing result asserts that for BBSARAH, W-Avg, U-Avg and L-Avg have identical order of complexity.

Proposition 2. (BB-SARAH) Under Assumptions 1 and 4, if we choose m = O(κ2) and θκ = O(κ), then BBSARAH ﬁnds a solution with E f( xs) 2 ϵ using O (n + κ2) ln 1

ϵ IFO calls, when one of these conditions holds: i) either U-Avg with θκ > κ; or ii) L-Avg with θκ > 3/2κ; or, iii) W-Avg with θκ > κ.

The price paid for having automatically tuned step sizes is a worse dependence of the complexity on κ, compared with the bounds in Corollaries 1 and 2. The cause of the worse dependence on κ is that one has to choose a large m at the order of O(κ2). However, such an automatic tuning of the step size comes almost as a free lunch when problem (1) is well conditioned, or, in the big data regime, e.g., κ2 n or κ2 n, since the dominant term in complexity is O(n ln 1

ϵ ) for both SVRG and BB-SVRG. On the other hand, it is prudent to stress that with κ2 n, the BB step sizes slow down convergence.

Given the same order of complexity, the empirical performance of BB-SARAH with different averaging types is showcased in Fig. 4(b) with the parameters chosen as in Proposition 2. It is observed that W-Avg converges most rapidly, while U-Avg outperforms L-Avg. This conﬁrms our theoretical insight, that is, W-Avg and U-Avg are more suitable when m is chosen large enough.

4.3. Tune-Free Variance Reduction

Next, the ultimate format of the almost tune-free variance reduction is presented using SARAH as an example. We will discuss how to choose the iteration number of inner loops and averaging schemes for BB step sizes.

Adaptive inner loop length. It is observed that the BB step size can change over a wide range of values (see Appendix C for derivations),

1 θκL ηs 1 θκµ. (8)

Given θκ = O(κ), ηs can vary from O(µ/L2) to O(1/L). Such a wide range of ηs blocks the possibility to ﬁnd a single m suitable for both small and large ηs at the same time. From a theoretical perspective, choosing m = O(κ2) in both Propositions 1 and 2 is mainly for coping with the small step sizes ηs = O(1/(Lθκ)). But such a choice is too pessimistic for large ones ηs = O(1/(µθκ)). In fact, choosing m = O(κ) for ηs = O(1/L) is good enough, as suggested by Corollaries 1 and 2. These observations motivate us to design an ms that changes dynamically per outer loop s.

Reﬂecting on the convergence of SARAH, it is sufﬁcient to set the inner loop length ms according to the ηs used. To highlight the rationale behind our choice of ms, let us consider BB-SARAH with U-Avg as an example that features convergence rate λs = 1 µηsms + ηs L 2 ηs L (Nguyen et al., 2017). Set θκ > κ as in Proposition 2 so that the second term of λs is always less than 1. With a large step size ηs = O(1/L), and by simply choosing ms = O 1/(µηs) , one can ensure a convergent iteration having e.g., λs < 1.

Almost Tune-Free Variance Reduction

0 10 20 30 40 #grad/n

SGD SVRG SARAH BB-SVRG BB-SARAH

0 10 20 30 40 #grad/n

SGD SVRG SARAH SVRG BB-SARAH

0 10 20 30 #grad/n

SGD SVRG SARAH BB-SVRG BB-SARAH

(a) a9a (b) rcv1 (c) real-sim

Figure 5. Tests of BB-SVRG and BB-SARAH on different datasets.

With a small step size ηs = O 1/(κL) though, choosing ms = O 1/(µηs) also leads to λs < 1. These considerations prompt us to adopt a time-varying inner loop length adjusted by ηs in (7) as

ms = c µηs . (9)

Such choices of ηs and ms at ﬁrst glance do not lead to a tune-free algorithm directly, because one has to ﬁnd an optimal θκ and c through tuning. Fortunately, there are simple choices for both c and θκ. In Propositions 1 and 2, the smallest selected θκ for SVRG and SARAH with different types of averaging turns out to be a reliable choice; while choosing c = 1 has been good enough throughout our numerical experiments. Although the selection of these parameters violates slightly the theoretical guarantee, its merits lie in the simplicity. And in our experiments, no divergence has been observed by these parameter selections.

Averaging schemes. As discussed in subsection 3.4, WAvg gains in performance when either ms or ηs is large. Since ms and ηs are inversely proportional (cf. (9)), it is clear that one of the two sufﬁces to be large; and for this reason, we will rely on W-Avg for BB-SARAH.

Extensions regarding almost tune-free variance reduction for (non)convex problems can be found in our technical note (Li and Giannakis, 2019).

5. Numerical Tests

To assess performance, the proposed tune-free BB-SVRG and BB-SARAH are applied to binary classiﬁcation via regularized logistic regression (cf. (6)) using the datasets a9a, rcv1.binary, and real-sim from LIBSVM3. Details regarding the datasets, the µ values used, and implementation details are deferred to Appendix D.2.

For comparison, the selected benchmarks are SGD, SVRG with U-Avg, and SARAH with U-Avg. The step size for

3Online available at https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html.

SGD is η = 0.05/(L(ne + 1)), where ne is the index of epochs. For SVRG and SARAH, we ﬁx m = 5κ, and tune for the best step sizes. For BB-SVRG, we choose ηs and ms as (9) with θκ = 4κ (as in Proposition 1) and c = 1. While we choose θκ = κ (as in Proposition 2) and c = 1 for BB-SARAH. W-Avg is adopted for both BB-SVRG and BB-SARAH.

The results are showcased in Fig. 5. We also tested BBSVRG with parameters chosen as (Tan et al., 2016, Thm. 3.8). However it only slightly outperforms SGD and hence is not plotted here (see the blue line in Fig. 4(a) as a reference). On dataset a9a, BB-SARAH outperforms tuned SARAH. BB-SVRG is worse than SVRG initially, but has similar performance around the 40th sample pass on the xaxis. On dataset rcv1 however, BB-SARAH, BB-SVRG and SARAH have similar performance, improving over SVRG. On dataset real-sim, BB-SARAH performs almost identical to SARAH. BB-SVRG exhibits comparable performance with SVRG.

6. Conclusions

Almost tune-free SVRG and SARAH were developed in this work. Besides the BB step size for eliminating the tuning for step size, the key insights are that both i) averaging, as well as ii) the number of inner loop iterations should be adjusted according to the BB step size. Speciﬁc major ﬁndings include: i) estimate sequence based provably linear convergence of SVRG and SARAH, which enabled new types of averaging for efﬁcient variance reduction; ii) theoretical guarantees of BB-SVRG and BB-SARAH with different types of averaging; and, iii) implementable tune-free variance reduction algorithms. The efﬁcacy of the tune-free BB-SVRG and BB-SARAH were corroborated numerically.

Acknowledgements

The authors would like to thank anonymous reviewers for their constructive feedback. We also gratefully acknowledge the support from NSF grants 1711471, and 1901134.

Almost Tune-Free Variance Reduction

Alekh Agarwal and Leon Bottou. A lower bound for the optimization of ﬁnite sums. In Proc. Intl. Conf. on Machine Learning, pages 78 86, Lille, France, 2015.

Jonathan Barzilai and Jonathan M Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141 148, 1988.

L eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. ar Xiv preprint ar Xiv:1606.04838, 2016.

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. In Proc. Advances in Neural Info. Process. Syst., pages 15210 15219, Vancouver, Canada, 2019.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Proc. Advances in Neural Info. Process. Syst., pages 1646 1654, Montreal, Canada, 2014.

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Proc. Advances in Neural Info. Process. Syst., pages 687 697, Montreal, Canada, 2018.

Bin Hu, Stephen Wright, and Laurent Lessard. Dissipativity theory for accelerating stochastic variance reduction: A uniﬁed analysis of SVRG and Katyusha using semideﬁnite programs. ar Xiv preprint ar Xiv:1806.03677, 2018.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Advances in Neural Info. Process. Syst., pages 315 323, Lake Tahoe, Nevada, 2013.

Jakub Konecn y and Peter Richt arik. Semi-stochastic gradient descent methods. ar Xiv preprint ar Xiv:1312.1666, 2013.

Dmitry Kovalev, Samuel Horvath, and Peter Richtarik. Don t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. ar Xiv preprint ar Xiv:1901.08689, 2019.

Andrei Kulunchakov and Julien Mairal. Estimate sequences for variance-reduced stochastic composite optimization. In Proc. Intl. Conf. Machine Learning, pages 3541 3550, Long Beach, CA, 2019.

Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex ﬁnite-sum optimization via SCSG methods. In Proc. Advances in Neural Info. Process. Syst., pages 2348 2358, Long Beach, CA, 2017.

Bingcong Li and Georgios B Giannakis. Adaptive step sizes in variance reduction via regularization. ar Xiv preprint ar Xiv:1910.06532, 2019.

Bingcong Li, Meng Ma, and Georgios B Giannakis. On the convergence of SARAH and beyond. In Proc. Intl. Conf. on Artiﬁcial Intelligence and Statistics, Palermo, Italy, 2020.

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for ﬁrst-order optimization. In Proc. Advances in Neural Info. Process. Syst., pages 3384 3392, Montreal, Canada, 2015.

Yan Liu, Congying Han, and Tiande Guo. A class of stochastic variance reduced methods with an adaptive stepsize. URL http://www.optimization-online. org/DB_FILE/2019/04/7170.pdf.

Julien Mairal. Optimization with ﬁrst-order surrogate functions. In Proc. Intl. Conf. on Machine Learning, pages 783 791, Atlanta, 2013.

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2004.

Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak aˇc. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proc. Intl. Conf. Machine Learning, Sydney, Australia, 2017.

Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Proc. Advances in Neural Info. Process. Syst., pages 1574 1582, Montreal, Canada, 2014.

Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proc. Intl. Conf. on Machine Learning, pages 314 323, New York City, NY, 2016.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407, 1951.

Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. In Proc. Advances in Neural Info. Process. Syst., pages 2663 2671, Lake Tahoe, Nevada, 2012.

Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, vol. 14:pp. 567 599, Feb. 2013.

Almost Tune-Free Variance Reduction

Conghui Tan, Shiqian Ma, Yu-Hong Dai, and Yuqiu Qian. Barzilai-Borwein step size for stochastic gradient descent. In Proc. Advances in Neural Info. Process. Syst., pages 685 693, Barcelona, Spain, 2016.

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.

Zhuang Yang, Zengping Chen, and Cheng Wang. Accelerating mini-batch sarah by step size rules. ar Xiv preprint ar Xiv:1906.08496, 2019.