# on_biased_stochastic_gradient_estimation__99520dd8.pdf

Journal of Machine Learning Research 23 (2022) 1-43 Submitted 4/20; Revised 7/21; Published 2/22

On Biased Stochastic Gradient Estimation

Derek Driggs d.driggs@maths.cam.ac.uk Department of Applied Mathematics and Theoretical Physics University of Cambridge Cambridge, CB3 0WA, UK

Jingwei Liang jingwei.liang@sjtu.edu.cn Institute of Natural Sciences and School of Mathematical Sciences Shanghai Jiao Tong University Shanghai, 200240,China

Carola-Bibiane Sch onlieb cbs31@cam.ac.uk Department of Applied Mathematics and Theoretical Physics University of Cambridge Cambridge, CB3 0WA, UK

Editor: Julien Mairal

We present a uniform analysis of biased stochastic gradient methods for minimizing convex, strongly convex, and non-convex composite objectives, and identify settings where bias is useful in stochastic gradient estimation. The framework we present allows us to extend proximal support to biased algorithms, including SAG and SARAH, for the ﬁrst time in the convex setting. We also use our framework to develop a new algorithm, Stochastic Average Recursive Gradi Ent (SARGE), that achieves the oracle complexity lower-bound for nonconvex, ﬁnite-sum objectives and requires strictly fewer calls to a stochastic gradient oracle per iteration than SVRG and SARAH. We support our theoretical results with numerical experiments that demonstrate the beneﬁts of certain biased gradient estimators. Keywords: stochastic gradient descent, variance reduction, biased gradient estimation

1. Introduction

In this paper, we focus on the following composite minimization problem:

min x Rp F(x) def = f(x) + g(x) . (1)

Throughout, we assume:

g is proper and closed such that its proximity operator (see (3) in Section 2) is welldeﬁned, f admits a ﬁnite-sum structure f(x) def = 1 n Pn i=1 fi(x), and for all i {1, 2, , n}, fi is L-Lipschitz continuous for some L > 0. We consider three settings: the convex setting, where all of {fi}n i=1 and g are convex; the strongly convex setting, where additionally g is strongly convex; and the non-convex setting, where {fi}n i=1 and g are not necessarily convex.

. Corresponding author

2022 Derek Driggs, Jingwei Liang, Carola-Bibiane Sch onlieb.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-316.html.

Driggs, Liang, Sch onlieb

Problems of this form arise frequently in many areas of science and engineering, such as machine learning, statistics, operations research, and imaging. For instance, in machine learning, these problems often arise as empirical risk minimization problems from classiﬁcation and regression tasks. Examples include ridge regression, logistic regression, Lasso, and ℓ1-regularized logistic regression (Bishop, 2006). Principal component analysis (PCA) can also be formulated as a problem with this structure, where the functions fi are non-convex (Garber and Hazan, 2015; Allen-Zhu and Yuan, 2018). In imaging, ℓ1 or total variation regularization is often combined with diﬀerentiable data discrepancy terms that appear in both convex and non-convex instances (Bredies and Lorenz, 2018).

1.1 Stochastic gradient methods

Two classical approaches to solve (1) are the proximal gradient descent method (PGD) (Lions and Mercier, 1979) and its accelerated variants, including inertial PGD (Liang et al., 2017) and FISTA (Beck and Teboulle, 2009). For these deterministic approaches, the full gradient of f must be evaluated at each iteration, which often requires huge computational resources when n is large. Such a drawback makes these schemes unsuitable for large-scale machine learning tasks. By exploiting the ﬁnite sum structure of f, stochastic gradient methods enjoy low periteration complexity while achieving comparable convergence rates. These qualities make stochastic gradient methods the standard approach to solving many problems in machine learning, and are gaining popularity in other areas such as image processing (Chambolle et al., 2018). Stochastic gradient descent (SGD) was ﬁrst proposed in the 1950 s (Robbins and Monro, 1951) and has experienced a renaissance in the past decade, with numerous variants of SGD proposed in the literature (see, for instance (Schmidt et al., 2017; Johnson and Zhang, 2013; Defazio et al., 2014a) and references therein). Most of these algorithms can be summarized into one general form, which is described below in Algorithm 1.

Algorithm 1 Stochastic gradient descent framework

Input: starting point x0 Rp, gradient estimator e .

1: for k = 0, 1, , T 1 do

2: Compute the stochastic gradient estimate e k at the current iterate xk.

3: Choose the step size/learning rate ηk.

4: Update xk+1: xk+1 proxηkg(xk ηk e k). (2)

Below we summarize several popular stochastic gradient estimators e k:

SGD Classic stochastic gradient descent (Robbins and Monro, 1951) uses the following gradient estimator at iteration k: $ sample jk uniformly at random from {1, ..., n}, e SGD k = fjk(xk).

On Biased Stochastic Gradient Estimation

At each step, SGD uses the gradient of the sampled function fjk(xk) as a stochastic approximation of the full gradient f(xk). It is an unbiased estimate as Ek[ fjk(xk)] = f(xk). It is also memoryless: every update of xk+1 depends only upon xk and the random variable jk. The variance of SGD is does not vanish as the algorithm converges.

SAG To deal with the non-vanishing variance of SGD, in (Roux et al., 2012; Schmidt et al., 2017) the authors introduce the SAG gradient estimator, which uses the gradient history to decrease its variance. With fi(ϕi 0) = 0, i = 1, ..., n, the SAG gradient estimator is computed using the following procedure:

sample jk uniformly at random from {1, ..., n},

e SAG k = 1

n( fjk(xk) fjk(ϕjk k )) + 1

Pn i=1 fi(ϕi k),

update the gradient history: fi(ϕi k+1) = fi(xk) if i = jk,

fi(ϕi k) o.w.

Here, for each i {1, ..., n}, fi(ϕi k) is a stored gradient of fi from a previous iteration. With the help of memory, the variance of the SAG gradient estimator diminishes as the algorithm converges. Estimators that satisfy this property are known as variance-reduced estimators. In contrast to the SGD estimator, e SAG k is a biased estimate of f(xk).

SAGA Building on (Roux et al., 2012; Schmidt et al., 2017), Defazio et al. (2014a) propose the unbiased gradient estimator SAGA, which is computed using the procedure below.

Sample jk uniformly at random from {1, ..., n}, e SAGA k = fjk(xk) fjk(ϕjk k ) + 1

Pn i=1 fi(ϕi k),

update the gradient history : fi(ϕi k+1) =

( fi(xk) if i = jk,

fi(ϕi k) o.w.

Compared to e SAG, the SAGA estimator gives less weight to stored gradients. With this adjustment, e SAGA is unbiased while maintaining the variance reduction property. Similar gradient estimators can be found in Point-SAGA (Defazio, 2016), Finito (Defazio et al., 2014b), MISO (Mairal, 2014), SDCA (Shalev-Shwartz and Zhang, 2013), and those in (Hofmann et al., 2015).

SVRG Another popular variance-reduced estimator is SVRG (Johnson and Zhang, 2013). The SVRG gradient estimator is computed as follows:

For s = 0, , S

n Pn i=1 fi(ϕs),

For k = 1, , m $ Sample jk uniformly at random from {1, , n}, e SVRG k = fjk(xk) fjk(ϕs) + f(ϕs),

where ϕs is a snapshot point updated every m steps. The algorithms prox-SVRG (Xiao and Zhang, 2014), Natasha (Allen-Zhu, 2017), Katyusha (Allen-Zhu, 2018a),

Driggs, Liang, Sch onlieb

Katyusha X (Allen-Zhu, 2018b), Natasha2 (Allen-Zhu, 2018c), Mi G (Zhou et al., 2018), ASVRG (Shang et al., 2018), and VARAG (Lan et al., 2019) use the SVRG gradient estimator.

SARAH In (Nguyen et al., 2017) the authors propose a recursive modiﬁcation to SVRG.

For s = 0, , S e SARAH k 1 = 1

n Pn i=1 fi(ϕs),

For k = 1, , m $ Sample jk uniformly at random from {1, , n}, e SARAH k = fjk(xk) fjk(xk 1) + e SARAH k 1 ,

Like SAG, SARAH is a biased gradient estimator. It is also used in prox-SARAH (Pham et al., 2019), SPIDER (Fang et al., 2018), SPIDERBoost (Wang et al., 2018) and SPIDER-M (Wang et al., 2019).

We refer to algorithms employing (un)biased gradient estimators as (un)biased stochastic algorithms, respectively. The body of work on biased algorithms is stunted compared to the enormous literature on unbiased algorithms, leading to several gaps in the development of biased stochastic gradient methods. We list a few below.

Complex convergence proofs. It is often diﬃcult to analyze biased stochastic algorithms. The convergence proof of the biased algorithm SAG is especially complex, requiring computational veriﬁcation (Roux et al., 2012; Schmidt et al., 2017). Sub-optimal convergence rates. In the convex setting with g 0, SARAH achieves a complexity bound of O(log(1/ϵ)

ϵ ) (Nguyen et al., 2017) for ﬁnding a point xk such that E[F( xk) F(x )] ϵ. In comparison, SAGA and SVRG achieve a complexity bound of O(1/ϵ) which is the same as deterministic proximal gradient descent. Lack of proximal support. Bias also makes it diﬃcult to handle non-smooth functions. To the best of our knowledge, there are no theoretical guarantees for biased algorithms to solve (1) with g 0 that take advantage of convexity when it is present. Despite the above shortcomings, there are notable exceptions that suggest biased algorithms are worth further consideration. Recently, (Pham et al., 2019; Fang et al., 2018; Wang et al., 2018, 2019) proved that algorithms using the SARAH gradient estimator require O( n/ϵ2) stochastic gradient evaluations to ﬁnd an ϵ-ﬁrst-order stationary point. This matches the complexity lower-bound for non-convex, ﬁnite-sum optimization for smooth functions fi and n O(ϵ 4) (Fang et al., 2018). For comparison, the best complexity bound obtained for SAGA and SVRG in this setting is O(n2/3/ϵ2) (Reddi et al., 2016a; Allen-Zhu and Hazan, 2016), and this performance requires large mini-batches of size O(n2/3). A detailed summary of existing complexity bounds for the variance-reduced gradient estimators mentioned above is provided in Tables 1 and 2 for convex and non-convex objectives, respectively. The complexity bound for SAGA (Defazio et al., 2014a), SVRG (Johnson and Zhang, 2013) and SAG (Schmidt et al., 2017) on convex objectives is O(n+L

ϵ ), which can be improved to O((n+κ) log(1/ϵ)) when strong convexity is present. For smooth objectives (where g 0), this complexity is O( n L

ϵ ) for SAGA and SVRG (Reddi et al., 2016a), while SAG requires g 0 for any convergence results. Convergence results for SARAH

On Biased Stochastic Gradient Estimation

Convex Convex and g 0 Strongly Convex Prox-Support?

ϵ ) O((n + κ) log(1/ϵ)) Yes

ϵ ) O((n + κ) log(1/ϵ)) Yes

SAG None O(n+L

ϵ ) O((n + κ) log(1/ϵ)) No

SARAH None O(n+L log(1/ϵ)

ϵ ) O((n + κ) log(1/ϵ)) No

Table 1: Existing complexity bounds for stochastic gradient estimators for convex objectives. Complexities with a represent the number of stochastic gradient oracle calls required to ﬁnd an ϵ-approximate stationary point (as in Deﬁnition 7). The other complexities are for ﬁnding a point satisfying E[F( xk) F(x )] ϵ in the convex case and E[ xk x 2] ϵ in the strongly convex case. The parameter µ is the strong convexity constant, and κ = L/µ is the condition number.

on convex objectives also require g 0, and the proven complexity is worse than similar results for SAGA, SVRG, and SAG by a logarithmic factor (Nguyen et al., 2017).

There are several accelerated algorithms that achieve better convergence rates than those in Table 1. SVRG++ achieves a complexity of O(n log(1/ϵ) + L/ϵ) on convex objectives using an epoch-doubling procedure (Allen-Zhu and Yuan, 2018). Katyusha, an accelerated variant of SVRG, has complexities of O(n log(1/ϵ) + p

n L/ϵ) on convex objectives and O(n + nκ log(1/ϵ)) with strong convexity. Combining a variance-reduced algorithm with the Catalyst acceleration scheme produces algorithms with the same convergence rates up to logarithmic factors (Lin et al., 2015). These accelerated algorithms are not directly comparable to the non-accelerated algorithms in this paper, so we leave these rates out of Table 1. The algorithms considered in this work can be accelerated using momentum schemes as well, and this is the subject of a related work (Driggs et al., 2020).

On non-convex objectives, SAGA and SVRG achieve complexities of O(n L

ϵ2 ), and this

rate can be improved to O(n2/3L

ϵ2 ) using large mini-batches of size O(n2/3) (Reddi et al., 2016b). Although we do not consider mini-batching in this work, using large mini-batch sizes could similarly improve the presented complexities for B-SAGA and B-SVRG. SAG has not been previously analyzed in the non-convex setting, so this work presents the ﬁrst convergence results for SAG in this setting as a special case of our results for B-SAGA. SARAH achieves the oracle complexity lower-bound of O( n L

ϵ2 ) (Pham et al., 2019).

Table 3 summarises the convergence rates provided in this work. Our results provide proximal support to biased algorithms such as SARAH and SAG for the ﬁrst time, prove state-of-the-art convergence rates for all algorithms in the non-convex setting, and improve the best-known convergence rates for SARAH on convex objectives. These strong results for non-smooth, non-convex problems and biased estimators comes at the cost of recovering suboptimal convergence rates for SAGA and SVRG on convex problems.

1.2 Contributions

This work provides three main contributions:

Driggs, Liang, Sch onlieb

Non-Convex, No Mini-Batching With Mini-Batching Prox-Support?

ϵ2 ) O(n2/3L

ϵ2 ) O(n2/3L

SAG None None No

SARAH O( n L

ϵ2 ) O( n L

Table 2: Existing complexity bounds for stochastic gradient estimators for non-convex optimization. These complexities represent the number of stochastic gradient oracle calls required to ﬁnd an ϵ-approximate stationary point (as in Deﬁnition 7). Using mini-batches of size n2/3 optimizes the complexity of SAGA and SVRG in this setting. While we do not consider mini-batching in this work, this improvement likely extends to B-SAGA and B-SVRG as well.

1. We introduce a framework for the systematic analysis of a large class of stochastic gradient methods and investigate a bias-variance tradeoﬀarising from our analysis. Our analysis provides proximal support to biased algorithms for the ﬁrst time in the convex setting.

2. We apply our framework to derive convergence rates for SARAH and biased versions of SAGA and SVRG on convex, strongly convex, and non-convex problems.

3. We design a new recursive gradient estimator, Stochastic Average Recursive Gradi Ent (SARGE), that achieves the same convergence rates as SARAH but never computes a full gradient, giving it a strictly smaller per-iteration complexity than SARAH. In particular, we show that SARGE achieves the oracle complexity lower bound for non-convex ﬁnite-sum optimization. To study the eﬀects of bias on the SAGA and SVRG estimators, we introduce Biased SAGA (B-SAGA) and Biased SVRG (B-SVRG). For θ > 0, these two gradient estimators read

B-SAGA: e B-SAGA k def = 1

θ( fjk(xk) fjk(ϕjk k )) + 1

Pn i=1 fi(ϕi k).

B-SVRG: e B-SVRG k def = 1

θ( fjk(xk) fjk(ϕs)) + f(ϕs). In both B-SAGA and B-SVRG, the bias parameter θ adjusts how much weight is given to stored gradient information. When θ = n, e B-SAGA k recovers the SAG gradient estimator. Motivated by the desirable properties of SARAH, we propose a new gradient estimator, Stochastic Average Recursive Gradi Ent (SARGE), which is deﬁned below

e SARGE k def = fjk(xk) ψjk k + 1

Pn i=1 ψi k (1 1

n)( fjk(xk 1) e SARGE k 1 ),

where the variables ψi k follow the update rule ψjk k+1 = fjk(xk) (1 1

n) fjk(xk 1). Similar to SAGA, SARGE uses stored gradient information to avoid having to compute the full gradient, a computational burden that SVRG and SARAH require for variance reduction. A summary of the complexity results obtained from our analysis for SAG/B-SAGA, B-SVRG, SARAH, and SARGE are provided in Table 3. Note that the result for SAG is included in B-SAGA.

On Biased Stochastic Gradient Estimation

Convex Strongly Convex Non-Convex Prox-Support?

B-SAGA O(n L

ϵ ) O(nκ log(1/ϵ)) O(n L

B-SVRG O(n L

ϵ ) O(nκ log(1/ϵ)) O(n L

SARAH O( n L+

Ln3/4 ϵ ) O(max{ nκ, n} log(1/ϵ)) O( n L

SARGE O( n L+

Ln3/4 ϵ ) O(max{ nκ, n} log(1/ϵ)) O( n L

Table 3: Complexity bounds obtained from our analytical framework. These complexities represent the number of stochastic gradient oracle calls required to ﬁnd a point xk satisfying E[F( xk) F(x )] ϵ for the convex case, E[ xk x 2] ϵ for the strongly convex case, and an ϵ-approximate stationary point in the non-convex case. While we do not recover the best-known rates for (unbiased) SAGA and SVRG in the convex setting, our rates for B-SAGA, B-SVRG, SARAH, and SARGE are the ﬁrst known for this problem class, and our rates for SARAH and SARGE are better than the best-known rates for SAGA and SVRG in the convex setting.

Paper organization Preliminary results and notations are provided in Section 2. A discussion on the tradeoﬀbetween bias and variance in stochastic optimization is included in Section 3. Our main convergence results are presented in Section 4. In Section 5, we substantiate our theoretical results using numerical experiments involving several classic regression tasks arising from machine learning. All the proofs of the main results are collected in the appendix.

2. Preliminaries and notations

Throughout the paper, Rp is a p-dimensional Euclidean space equipped with scalar inner product , and associated norm || ||. The sub-diﬀerential of a proper closed convex

function g is the set-valued operator deﬁned by g(x) def = v Rn|g(x ) g(x) + v, x x , x Rn , the proximity, or proximal map of g is deﬁned as

proxηg(y) def = arg minx Rn ηg(x) + 1

2||x y||2, (3)

where η > 0 and y Rp. Below we summarize some useful results in convex analysis.

Lemma 1 (Nesterov (2004, Thm 2.1.5)) Suppose f is convex with an L-Lipschitz continuous gradient. We have for every x, u Rp,

f(x) f(u) 2 2L(f(x) f(u) f(u), x u ).

When f is a ﬁnite sum as in (1), Lemma 1 is equivalent to the following result.

Lemma 2 Let f(x) = 1

n Pn i=1 fi(x), where each fi is convex with an L-Lipschitz gradient. Then for every x, u Rp,

1 2Ln Pn i=1 fi(x) fi(u) 2 f(x) f(u) f(u), x u .

Driggs, Liang, Sch onlieb

Lemma 2 is obtained by applying Lemma 1 to each fi and averaging.

Lemma 3 Suppose g is µ-strongly convex with µ 0, and let z = proxηg(x ηd) for some x, d Rp and η > 0. Then, for any y Rp,

2 x y 2 1+µη

2 z x 2 ηg(z) + ηg(y).

Proof By the strong convexity of g, we have g(z) g(y) ξ, z y µ

2 z y 2, ξ g(z). From the deﬁnition of the proximal operator, we have that 1

η(x z) d g(z). Therefore,

g(z) g(y) ξ, z y µ

η x z ηd, z y µ

= d, z y + 1

η x z, z y µ

2η||x z||2 1

2η||z y||2 + 1

2η||x y||2 µ

Multiplying by η and rearranging yields the assertion.

The next lemma is an analogue of the descent lemma for gradient descent when the gradient is replaced with an arbitrary vector d.

Lemma 4 Suppose g is µ-strongly convex for µ 0, and let z = proxηg(x ηd). The following inequality holds for any λ > 0.

0 η(F(x) F(z)) + η 2Lλ d f(x) 2 + (ηL(λ+1)

Proof This follows immediately from Lemma 3.

0 = η d, x z + η d, z x

1 η d, x z 2+µη

2 z x 2 + η(g(x) g(z))

= η f(x), x z + η d f(x), x z 2+µη

2 z x 2 + η(g(x) g(z))

2 η(F(x) F(z)) + η d f(x), x z + (ηL

3 η(F(x) F(z)) + η 2Lλ d f(x) 2 + (ηL(λ+1)

Inequality 1 is due to Lemma 3 with y = x, 2 is due to the Lipschitz continuity of fi, and 3 is Young s.

The previous two lemmas require that g to be convex. Similar results hold in the nonconvex case as well.

Lemma 5 Let z proxηg(x ηd) for some x, d Rp and η > 0. Then, for any y Rp,

2 z x 2 ηg(z) + ηg(y).

On Biased Stochastic Gradient Estimation

Proof By the deﬁnition of z,

z arg min v

2η v x 2 + g(v) .

Let v = y, then

g(z) g(y) d, y z + 1

2η x y 2 x z 2 .

Multiplying by η completes the proof.

Lemma 6 Let z proxηg(x ηd). Then

F(z) F(y) + f(x) d, z y + (L

2η) x z 2 + (L

Proof By the Lipschitz continuity of f, we have the inequalities

f(x) f(y) f(x), x y + L

f(z) f(x) f(x), z x + L

Furthermore, by Lemma 5, g(z) g(y) d, y z + 1 2η x y 2 x z 2 . Adding these inequalities together completes the proof.

In the non-convex setting, to measure convergence of the sequence to a ﬁrst-order stationary point, we use the notion of a generalized gradient (Nesterov, 2004).

Deﬁnition 7 (generalized gradient map) The generalized gradient map is deﬁned as

Gη(xk) def = 1

η(xk proxηkg(xk η f(xk))).

For any η > 0. A point x satisfying Gη(x) = 0 is a ﬁrst-order stationary point of f + g, and an ϵ-ﬁrst-order stationary point is a point satisfying Gη(x) ϵ.

When g 0, we have Gηk(xk) = f(xk) 0 if the sequence {xk} converges to some x Rp such that f(x ) = 0. For nontrivial g, suppose infk ηk > 0 and xk converges to some x such that x proxηg(x η f(x )), then Gηk(xk) 0.

3. A bias-variance tradeoﬀin stochastic gradient methods

In this section, we discuss the eﬀect of bias and variance of a stochastic gradient estimator on the performance of Algorithm 1. It is elementary that the mean-squared error (MSE) of a stochastic estimator can be decomposed into the sum of its variance and squared bias. In our setting, we have

Ek[ e k f(xk) 2] = Ek[e k] f(xk) 2 + Ek[ e k Ek[e k] 2].

This decomposition shows that a biased estimator might have a smaller MSE than an unbiased estimator as long as the bias suﬃciently diminishes the variance. This is the bias-variance tradeoﬀ. As we see below, a bias-variance tradeoﬀexists in our analysis of stochastic gradient methods, but with a slightly diﬀerent form. In what follows, we ﬁrst discuss the bias-variance tradeoﬀin the convex settings and then the non-convex setting.

Driggs, Liang, Sch onlieb

3.1 The convex case

Let x be a global minimizer of problem (1). From the update (2), let wk+1 g(xk+1). We have the following bound on the suboptimality at xk+1:

Ek[F(xk+1) F(x )]

= Ek[f(xk+1) f(xk) + f(xk) f(x ) + g(xk+1) g(x )]

2 Ek[ xk+1 xk 2] + Ek[ f(xk), xk+1 xk ] + f(xk), xk x + Ek[g(xk+1) g(x )]

2 Ek[ xk+1 xk 2] + Ek[ f(xk) e k, xk+1 xk ]

+ Ek[ f(xk) e k, xk x ] + Ek[ e k, xk+1 x ] + Ek[g(xk+1) g(x )]

2Ek[ f(xk) e k 2] + (L

2ϵ)Ek[ xk+1 xk 2]

+ Ek[ f(xk) e k, xk x ] + Ek[ e k + wk+1, xk+1 x µ

2 xk+1 x 2]

2Ek[ f(xk) e k 2] + (L

2η)Ek[ xk+1 xk 2]

+ Ek[ f(xk) e k, xk x ] 1+µη

2η Ek[ xk+1 x 2] + 1

2η xk x 2. (4) Inequality 1 follows from the convexity of f and Lipschitz continuity of f, 2 follows from the (strong) convexity of g, and 3 comes from the implicit deﬁnition of the proximal operator (3). For the last line of (4), we observe that the inner product term Ek[ f(xk) e k, xk x ] vanishes when e k is an unbiased estimate of f(xk). When the estimator is biased, we must develop a new way to control this term, together with Ek[|| f(xk) e k||2]. Hence, the following terms arise in our convergence analysis from the bias and the variance of the gradient estimator:

Bias : Ek[ f(xk) e k, xk x ] and Ek[e k] f(xk) 2,

Variance : Ek[ e k Ek[e k] 2]. (5)

An eﬀective gradient estimator minimizes all three of these terms simultaneously. As we later show in our MSE and bias-term bounds, SARAH and SARGE minimize these terms more eﬀectively than biased SAGA and SVRG estimators, leading to better convergence rates. We provide an explicit comparison between SARAH and SVRG in Appendix G.

Remark 8 (Non-composite case g = 0) When g = 0, for gradient descent, the descent property of f yields

f(xk+1) f(x ) (L

η) xk+1 xk 2 + f(xk) f(x ),

where η 2/L. For stochastic gradient descent, we obtain the following relationship:

Ek[f(xk+1) f(x )]

= Ek[f(xk+1) f(xk) + f(xk) f(x )]

Ek[ f(xk) e k, xk+1 xk ] + (L

η)Ek[ xk+1 xk 2] + f(xk) f(x )

2Ek[ f(xk) e k 2] + (L

η)Ek[ xk+1 xk 2] + f(xk) f(x ).

On Biased Stochastic Gradient Estimation

Compared to (4), there is no inner product term in (6), which makes the analysis of the non-composite case much simpler. This is one reason why biased algorithms have been successfully studied in non-composite setting, but not in the composite setting.

3.2 The non-convex case

The inﬂuence of bias is simpler in the non-convex setting and independent of g, which explains why biased algorithms have recently found success for these problems. To begin, let ˆxk+1 proxηg/2(xk η

2 f(xk)). Applying Lemma 6 with z = ˆxk+1, y = x = xk and d = f(xk), we have

F(ˆxk+1) F(xk) + ( L

η) ˆxk+1 xk 2.

Again, applying Lemma 6 with z = xk+1, y = ˆxk+1, x = xk, and d = e k, we obtain

F(xk+1) F(ˆxk+1) + f(xk) e k, xk+1 ˆxk+1

2 1 2η) xk+1 xk 2 + ( L

2 + 1 2η) ˆxk+1 xk 2

Adding these two inequalities together gives

F(xk+1) F(xk) + (L 1

2η) ˆxk+1 xk 2 + (L

2η) xk+1 xk 2

+ f(xk) e k, xk+1 ˆxk+1

1 F(xk) + (L 1

2η) ˆxk+1 xk 2 + (L

2η) xk+1 xk 2 + 2η f(xk) e k 2

8η ˆxk+1 xk+1 2

2 F(xk) + (L 1

4η) ˆxk+1 xk 2 + (L

4η) xk+1 xk 2 + 2η f(xk) e k 2. (7) Inequality 1 is Young s, and 2 is the standard inequality a c 2 2 a b 2 + 2 b c 2. In the non-convex case, the inner-product bias term does not appear, so the bias-variance tradeoﬀis the classical one.

3.3 General bounds on bias and variance

To ensure convergence of Algorithm 1 using a particular gradient estimator, we must bound the inner-product bias term, Ek[ f(xk) e k, xk x ], and the MSE, Ek[ f(xk) e k 2]. Below we introduce general bounds on these terms that allow us to establish convergence rates for a variety of gradient estimators. The ﬁrst of these is a bound on the MSE term.

Deﬁnition 9 (Bounded MSE) The stochastic gradient estimator e is said to satisfy the BMSE(M1, M2, ρM, ρF , m) property with parameters M1, M2 0, ρM, ρF (0, 1] and m 1 if there exist sequences Mk and Fk such that

Pm(s+1) 1 k=ms E[ e k f(xk) 2] Mms,

and the following bounds hold:

Mms (1 ρM)m Mm(s 1) + Fms + M1

Pm(s+1) 1 k=ms

Pn i=1E[ fi(xk+1) fi(xk) 2],

Fms Ps ℓ=0 M2(1 ρF )m(s ℓ)

Pm(s+1) 1 k=ms

Pn i=1E[ fi(xk+1) fi(xk) 2].

Driggs, Liang, Sch onlieb

The constant m is the epoch length of the gradient estimator, hence it is usually set to be O(n). This property is useful in convergence analyses because it bounds the MSE by a geometrically decaying sequence {Mmk}k N and a component that is proportional to the one-iteration progress of gradient descent (1/n Pn i=1 fi(xk+1) fi(xk) 2).

Most variance-reduced stochastic gradient estimators satisfy the BMSE property, including SAG, SAGA, SVRG, SARAH, and all the estimators considered by Hofmann et al. (2015). SGD does not satisfy this property, as its variance does not decay along the iterations.

Most existing work on the analysis of general stochastic gradient algorithms enforce bounds of this form on either the MSE or the moments of the stochastic estimator, with the crucial diﬀerence that existing works require the bounds to be Markovian (that is, dependent on only the previous iteration) (Bottou et al., 2018). In contrast, the BMSE property allows non-Markovian MSE bounds through the sequence Fk. This relaxation is crucial for the analysis of our new gradient estimator, SARGE.

In order to bound the inner-product bias term, we require the gradient estimator to admit a certain structure in its bias. In biased estimators such as SAG, the bias depends on the stored gradient values:

f(xk) Ek[e SAG k ] = (1 1

n Pn i=1 fi(ϕi k) .

We call estimators whose bias admits the above structure memory-biased gradient estimators. These include SAG, and more generally B-SAGA and B-SVRG.

Deﬁnition 11 (Memory-biased gradient estimator) The stochastic gradient estimator e is memory-biased with parameters θ > 0, B1 0, and m 1 if

f(xk) Ek[e k] = (1 1

Pn i=1 fi(ϕi k) ,

for some {ϕi k}n i=1 {xℓ}k 1 ℓ=0 , and for any s N0, Pm(s+1) 1 k=ms 1 n

Pn i=1 E[ xk ϕi k 2] B1

Pm(s+1) 1 k=ms E[ xk xk 1 2]. (8)

B-SAGA is clearly a memory-biased estimator, and so is B-SVRG where ϕi k = ϕi ms for all k in epoch s. The parameter θ controls the amount of bias in the estimator, and B1, in a sense, measures how stale the stored gradient information is. For memory-biased gradient estimators, the bias-term can be bounded by terms of the form xk ϕi k 2.

Lemma 12 Suppose e is memory-biased with parameter θ 1 and that F is µ-strongly convex with µ 0. For any λ > 0, the following inequality holds:

ηEk[F(xk+1) F(x )] η 2LλEk[ e k f(xk) 2] 1+µη

2 Ek[ xk+1 x 2] + 1

+ ( ηL(λ+1)

2)Ek[ xk+1 xk 2] + ηL

i=1 xk ϕi k 2.

On Biased Stochastic Gradient Estimation

The proof of Lemma 12 can be found in Appendix A. The bound of Lemma 12 is analogous to the bound in (4), but the inner-product bias term is replaced with ηL

θ) Pn i=1 xk ϕi k 2. This term is proportional to the progress of gradient descent (by (8)), so this provides the necessary control over the inner-product bias term.

Remark 13 Lemma 12 requires that θ 1, so the rates we derive for the convex setting hold only for θ 1. However, the convergence rate we prove in Section 4.1.2 for the non-convex setting, which allow θ (0, 1), also hold for convex problems. In summary, our results guarantee convergence for all θ > 0, but they suggest diﬀerent rates for the parameter settings θ < 1 and θ 1.

For estimators such as SARAH, the bias depends on the error in the previous gradient estimate, rather than previous stochastic gradients:

f(xk) Ek[e SARAH k ] = f(xk 1) e SARAH k 1 .

We refer to estimators of this type as recursively biased.

Deﬁnition 14 (Recursively biased gradient estimator) For any sequence {xk}, let e k be a stochastic gradient estimator generated from the points {xℓ}k ℓ=0. This estimator is recursively biased with parameters ρB (0, 1] and ν 1 if

f(xk) Ek[e k] =

( 0 for k νN0, (1 ρB)( f(xk 1) e k 1) o.w.

The parameter ν represents how many steps occur between full gradient evaluations. For SARGE, ν = because the full gradient is never computed. Recursively biased estimators admit a bound on the inner-product bias term that involves the estimator s MSE.

Lemma 15 Suppose e is a recursively biased gradient estimator with parameters ν 1 and ρB (0, 1]. Then, for any epoch s N {0} and ϵ > 0,

k=νs+1 |E f(xk 1) e k 1, xk x |

min n ν, 1 ρB

o Pν(s+1) 1

2 f(xk) e k 2 + 1

2ϵ xk+1 xk 2 .

The proof of Lemma 15 is in Appendix B. Lemma 15 shows that, for recursively biased estimators, the inner-product bias term f(xk 1) e k 1, xk x is bounded from above by the MSE, implying that introducing bias to decrease the MSE is a reasonable approach to design improved gradient estimators.

Remark 16 When ν = , which is true for SARGE, there is only a single epoch, s = 0. In this case, we adopt the convention 0 = 0, so that the sums appearing in Lemma 15 are well-deﬁned.

Driggs, Liang, Sch onlieb

4. Convergence rates

In this section, we analyze the convergence rates for the stochastic gradient methods. We ﬁrst provide very general convergence rates based on the bounds from the last section. Then, we specify the result to speciﬁc gradient estimators including memory-biased B-SAGA/BSVRG, and recursively biased SARAH and SARGE.

4.1 General convergence rates

For Algorithm 1, we consider a constant step size ηk η > 0. Given T iterations of Algorithm 1, deﬁne the average iterate x T def = 1/T PT k=1 xk.

4.1.1 Convex and strongly convex cases

The following theorem establishes convergence rates for memory-biased estimators in the convex regime.

Theorem 17 (Memory-biased estimators) Let e be a memory-biased gradient estimator parameterized by θ 1 and B1 0, which satisﬁes the BMSE(M1, M2, ρM, ρF , m) property. Let Θ = M1ρF +2M2

ρMρF and ρ = min{ρM, ρF }.

In the convex setting, let η = 1 L(1+3

E[F( x T ) F(x )] 1

2 + max n B1(1 1/θ)

2Θ 1, 0 o F(x0) F(x ) .

When g is additionally µ-strongly convex with µ > 0, let η = min 1 3L(1+3

2Θ B1µ(1 1/θ), ρ

2µ . The iterate x T satisﬁes

E[ x T x 2] (1 + µη) T 2

µ(F(x0) F(x )) + x0 x 2 .

The proof of Theorem 17 is provided in Appendix A. The next result establishes convergence rates for recursively biased gradient estimators whose proof is in Appendix B.

Theorem 18 (Recursively biased estimators) Let e be a recursively biased gradient estimator parameterized by ρB (0, 1) and ν 1, which satisﬁes the BMSE(M1, M2, ρM, ρF , m) property. Let B2 def = min {ν, 1/ρB}, Θ = M1ρF +2M2

ρMρF and ρ = min{ρM, ρF }.

In the convex setting, let η = 1

2ΘL(1+ (1 ρB)B2

1+δ )+L with δ = max{ p

LΘ1/2(1 ρB)B2

1, 0}. Then

E[F( x T ) F(x )] 1

2η x0 x 2 + δ(F(x0) F(x )) .

When g is additionally µ-strongly convex with µ > 0, let η = min 1 3L(4

1 µ(1 ρB)B2 , ρ

E[ x T x 2] (1 + µη) T ( 2

µ(F(x0) F(x )) + x0 x 2).

On Biased Stochastic Gradient Estimation

Both theorems hold true for smaller η; the choices in the theorems are the largest ones allowed by our analysis. For B-SAGA and B-SVRG, Θ = O(n2), while for SARAH and SARGE, Θ = O(n). This gives these recursive gradient estimators improved convergence rates and suggests that the bias in these estimators is more eﬀective than the bias in SAGA and SVRG.

4.1.2 Non-convex case

The analysis of biased gradient estimators is simpler for the non-convex setting than the convex ones due to the absence of the inner-product bias term in (7). Below we provide a uniform convergence guarantee for all gradient estimators satisfying the BMSE property, regardless of their bias. This suggests that in the non-convex setting, a large-bias, small MSE gradient estimator is favorable over an estimator with small bias and large MSE.

Theorem 20 Let e be a gradient estimator that satisﬁes the BMSE(M1, M2, ρM, ρF , m) property, let Θ = M1ρF +2M2

ρMρF , and let α be a chosen uniformly at random from the set

{0, 1, , T 1}. If F is non-convex, set η = 16Θ+1 1

16LΘ in Algorithm 1, and the point xα satisﬁes the following bound on its generalized gradient:

E[ Gη/2(xα) 2] 16(F(x0) F(x ))

Tη(1 4ηL) .

The proof of this result is provided in Appendix C.

Remark 21 The convergence result of Theorem 20 does not depend on the bias except through the MSE of the gradient estimator, which implies that incorporating arbitrary amounts of bias for a smaller MSE improves the convergence rate. This fact is what allows the recursively biased estimators SARAH and SARGE to achieve the oracle complexity lower bound for non-convex optimization when they are used in Algorithm 1.

4.2 Convergence rates for speciﬁc gradient estimators

In this section, we specialize the general convergence rates to analyze the performance of B-SAGA, B-SVRG, SARAH, and SARGE.

4.2.1 Biased SAGA and SVRG

B-SAGA and B-SVRG are examples of memory-biased gradient estimators, as their biases take the form

f(xk) Ek[e k] = (1 1

Pn i=1 fi(ϕi k) ,

for some previous iterates ϕi k. To establish convergence rates for B-SAGA and B-SVRG, we only need to show these estimators satisfy the BMSE property with suitable constants.

Lemma 22 The B-SAGA gradient estimator is memory-biased with B1 = 2n(2n + 1), and it satisﬁes the BMSE property with parameters ρM = 1 2n, m = 1, M2 = 0, ρF = 1, and

θ2 θ (0, 2], (2n + 1)(1 1

Driggs, Liang, Sch onlieb

The proof of Lemma 22 uses a slight modiﬁcation of existing variance bounds for the SAGA estimator, appearing in (Defazio et al., 2014a) for example. We include the proof in Appendix D. The B-SVRG gradient estimator satisﬁes the BMSE property with similar constants.

Lemma 23 The B-SVRG gradient estimator is memory-biased with B1 = 3m(m + 1), and it satisﬁes the BMSE property with parameters ρM = 1, M2 = 0, ρF = 1, and

θ2 θ (0, 2], 3m(m + 1)(1 1

With these constants established, Theorem 17 provides rates of convergence.1

Corollary 24 (Convergence rates for B-SAGA) Algorithm 1 achieves the following convergence guarantees using the B-SAGA gradient estimator:

In the convex setting, depending on the choice of θ, set the step size to

η = ηθ def =

n(2n+1)) : θ [1, 2],

1 L(1+6(1 1

n(2n+1)) : θ > 2,

and x T satisﬁes E[F( x T ) F(x )] = O(Ln

If additionally g is µ-strongly convex, set η = min ηθ, 1 4µn . Then x T satisﬁes E[ x T x 2] = O((1 + µη) T ).

In the non-convex setting, after T iterations, the generalized gradient at xα satisﬁes

E[ Gη/2(xα) 2] =

Tθ : η = θ 2L

n(2n+1), θ (0, 2],

θ ) : η = 1 2L(1 1

n(2n+1), θ > 2.

Corollary 25 (Convergence rates for B-SVRG) Algorithm 1 achieves the following convergence guarantees using the B-SVRG gradient estimator:

In the convex setting, depending on the choice of θ, set the step size to

6m(m+1)) : θ [1, 2],

1 L(1+3(1 1

6m(m+1)) : θ > 2.

After S epochs, the point xm S satisﬁes E[F( xm S) F(x )] = O(L/S).

If additionally g is µ-strongly convex, let η = min{ηθ, 1

2µ}. After S epochs, xm S satisﬁes E[ xm S x 2] = O((1 + µη) m S).

In the non-convex setting, after S epochs, the generalized gradient at xα satisﬁes

E[ Gη/2(xα) 2] =

3m(m+1), θ (0, 2],

O L S(1 1/θ) : η =

3m(m+1), θ > 2.

On Biased Stochastic Gradient Estimation

Our MSE bounds and convergence rates are optimized when θ = 2. Numerical experiments (including those in Section 5) suggest that setting θ in the range 1 < θ n gives the best performance for convex problems, and B-SAGA prefers larger values of θ than B-SVRG.

The convergence guarantees for θ (0, 1) still hold for convex objectives, but in this setting, the rates we obtain for θ 1 are superior, suggesting that for convex problems, θ should be larger than or equal to one for best performance. Our numerical experiments in Section 5 support this; setting θ < 1 can be beneﬁcial for non-convex problems, but we do not observe this for convex problems.

In the special case θ = 1, Corollaries 24 and 25 recover the state-of-the-art rates for SAGA and SVRG in the non-convex regime. For strongly convex problems, these rates are worse than existing convergence rates of O((1+min µ

n ) T ) proven for SAGA and SVRG (Defazio et al., 2014a; Xiao and Zhang, 2014). This diﬀerence is due to the generality of Theorem 17, as some memory-biased estimators, including B-SVRG, exhibit poor performance on strongly convex problems when the bias is large.

Corollaries 24 and 25 require step sizes that decrease with n, while existing results for SAG, SAGA, and SVRG allow step sizes that are independent of n. This is also due to the generality of Theorem 17. In practice, we ﬁnd B-SAGA converges with step sizes that are independent of n, but B-SVRG requires step sizes to decrease when the epoch length is larger.

4.2.2 SARAH and SARGE

The SARAH and SARGE gradient estimators are recursively biased, with

f(xk) Ek[e SARAH k ] = f(xk 1) e SARAH k 1

and f(xk) Ek[e SARGE k ] = (1 1

n)( f(xk 1) e SARGE k 1 ).

As we shall see, these biased estimators admit smaller MSE bounds than unbiased and memory-biased estimators, and this is reﬂected in their improved convergence rates. The following two lemmas establish the constants appearing in Theorem 18 for these estimators.

Lemma 27 The SARAH gradient estimator is recursively biased with parameters ρB = 0 and ν = m, and it satisﬁes the BMSE property with M1 = m, ρM = 1, ρF = 1, and M2 = 0.

Lemma 28 The SARGE gradient estimator is recursively biased with parameters ρB = 1/n and ν = , and it satisﬁes the BMSE property with M1 = 12, M2 = 39/n, ρM = 1 4n, ρF = 1 2n, and m = 1.

Proofs of these results are included in Appendices E and F, respectively. It is enlightening to compare these BMSE constants to those of B-SVRG and B-SAGA. M1 is a factor of n smaller for the SARAH and SARGE estimators than for the B-SVRG and B-SAGA estimators (as long as m = O(n) in SARAH and B-SVRG). This translates to an O( n) improvement in the convergence rates for SARAH and SARGE when L is O( n) or larger.

Driggs, Liang, Sch onlieb

Corollary 29 (Convergence rates for SARAH) When using the SARAH gradient estimator in Algorithm 1,

If F is convex, set η = 1 L(2

Lm3/4 . After T iterations, x T satisﬁes E[F( x T )

F(x )] = O(L m+

If F is µ-strongly convex, set η = min{ 1 3L(4

2m+1), 1 µm}, then E[ x T x 2] = O((1+

If F is non-convex, set η = 1 L

2m, then E[ Gη/2(xα) 2] O (L m/T).

Corollary 30 (Convergence rates for SARGE) When using the SARGE gradient estimator in Algorithm 1,

If F is convex, set η = 1 L(74 n+15

Ln3/4 +1), then E[F( x T ) F(x )] = O(L n+

If F is µ-strongly convex, set η = min{ 1 3L(16

3(n+13)+1), 1 4µn}, then E[ x T x 2] =

O((1 + µη) T ).

If F is non-convex, set η = 1 4L

3(n+13), then E[ Gη/2(xα) 2] O (L n/T).

Remark 31 Our theoretical results suggest the step sizes for SARAH and SARGE should decrease with n, and such step sizes lead to optimal convergence guarantees for non-convex problems. However this is not true in practice, as we ﬁnd that using larger step sizes that are independent of n leads to better performance. We provide examples in Section 5.

These convergence rates for convex objectives represent a signiﬁcant improvement over the performance of SAGA, SVRG, and full-gradient methods. Each of these algorithms require O(n L

ϵ ) stochastic gradient evaluations to ﬁnd a point satisfying F(x T ) F(x ) ϵ,

while SARAH and SARGE require only O( n L

ϵ ). These rates do not require the epochdoubling procedure of (Allen-Zhu and Yuan, 2018), although epoch-doubling can potentially be used to improve the performance of SARAH just as it improves the performance of SVRG on non-strongly convex objectives. This square-root dependence on n is present in the convergence rates for strongly convex and non-convex objectives as well, which is a signiﬁcant improvement over the dependence on n in the convergence rates of B-SAGA and B-SVRG. This better dependence on n is most signiﬁcant in the non-convex regime, where these convergence rates imply that the SARAH and SARGE estimators require only O(L n+

ϵ2 ) stochastic gradient evaluations to ﬁnd an ϵ-approximate stationary point, which is the oracle-complexity lower bound (Fang et al., 2018). Similar results already exist for algorithms using the SARAH estimator (Fang et al., 2018; Wang et al., 2019, 2018; Pham et al., 2019). Our results for SARGE show that achieving this complexity is possible without ever computing the full gradient.

5. Numerical Experiments

In this section, we present numerical experiments testing B-SAGA, B-SVRG, SARAH, and SARGE for minimizing convex, strongly convex, and non-convex objectives. We include one set of experiments comparing diﬀerent values of θ in B-SAGA and B-SVRG with a

On Biased Stochastic Gradient Estimation

ﬁxed step size and one set comparing SARAH and SARGE to B-SAGA and B-SVRG with the best values of θ.2

5.1 Convex and strongly convex objectives

Let (hi, li) Rp { 1}, i = 1, , n be the training set, where hi Rp is the feature vector of each data sample, and li is the binary label. Let β > 0 be a tuning parameter. The ridge regression problem takes the form

min x Rp 1 n

Pn i=1 (h i x li)2 + β

2 ||x||2 2.

LASSO is similar, but with the regularizer ||x||1 replacing ||x||2 2. These problems are of the form (1), where we set fi = (h i x li)2 and g equal to the regularizer. In ridge regression, g is strongly convex, and in LASSO, g is only convex. We consider four binary classiﬁcation data sets: australian, mushrooms, phishing, and ijcnn1 from LIBSVM3. We rescale the value of the data to [ 1, 1], set β = 1/n, and set the step size to η = 1 5L. To compare performance, we use the objective function value F(xk) F(x ) is considered.

Comparison of B-SAGA We ﬁrst compare the performance of B-SAGA under diﬀerent choices of θ for solving ridge regression and LASSO problems. Four choices of θ are considered: θ {1, 10, 100, n}, the results are provided below in Figure 1, from which we observe that B-SAGA consistently performs better with moderate amounts of bias (that is, θ (1, n)). For the considered datasets, overall θ = 10 provides the best performance.

Comparison of B-SVRG We also consider four choices of θ for B-SVRG, which are θ {0.5, 0.8, 1, 1.5}. The results are shown below in Figure 2. We observe that B-SVRG is more sensitive to the choice of θ; only small amounts of bias (that is, θ [0.8, 1.5]) can occasionally improve performance.

Comparison of diﬀerent gradient estimators Finally, we provide comparison of SAGA, B-SAGA with θ = 10, SVRG, SARAH and SARGE, the results are provided below in Figure 3 from which we observe that

SARAH performs similarly to SVRG, but is occasionally slower in early epochs. SARGE consistently outperforms all other methods except for B-SAGA with θ = 10. The above observations indicate that, depending on the data, biased schemes can beneﬁt from their biased gradient estimates. The free parameter θ reduces the MSE of the B-SAGA and B-SVRG gradient estimators leading to better performance, and the bias in SARAH and SARGE has a similar eﬀect.

5.2 Non-convex objectives

To test the eﬀect of bias in the non-convex setting, we consider the non-negative principal component analysis (NN-PCA) problems, which can be formulated as (Reddi et al., 2016b):

min x Rp F(x) 1

Pn i=1 (h i x)2 + ιC(x) ,

2. See https://github.com/derekdriggs/Stoch Opt for MATLAB scripts reproducing these experiments. 3. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

Driggs, Liang, Sch onlieb

10 20 30 40 50 60 70 80 90 100 10-15

(a) australian

0 50 100 150 200 250 300 10-15

(b) mushrooms

20 40 60 80 100 120 140 160 180 200 10-15

(c) phishing

20 40 60 80 100 120 140 160 180 200 10-15

0 20 40 60 80 100 120 10-15

(e) australian

0 50 100 150 200 250 10-15

(f) mushrooms

0 50 100 150 200 10-15

(g) phishing

0 20 40 60 80 100 120 10-15

Figure 1: First row, performance comparison ﬁtting a LASSO model for diﬀerent choices of θ in B-SAGA. Second row, performance comparison ﬁtting a ridge regression model for diﬀerent choices of θ in B-SAGA. The step size for each case is set to η = 1 5L.

0 20 40 60 80 100 120 10-15

(a) australian

0 50 100 150 200 250 300 10-15

(b) mushrooms

0 50 100 150 200 250 10-15

(c) phishing

0 20 40 60 80 100 10-15

0 20 40 60 80 100 120 10-15

(e) australian

0 50 100 150 200 250 10-15

(f) mushrooms

0 50 100 150 200 250 10-15

(g) phishing

0 10 20 30 40 50 60 10-15

Figure 2: First row, performance comparison ﬁtting a ridge regression model for diﬀerent choices of θ in B-SVRG. Second row, performance comparison ﬁtting a LASSO model for diﬀerent choices of θ in B-SVRG. The step size for each case is set to η = 1 5L.

where C def = {x Rp : x 1, x 0} is a convex set and ιC(x) =

+ : x / C is the

indicator function of C. Letting g = ιC, the operator proxηg is the projection onto C, which can be computed eﬃciently.

On Biased Stochastic Gradient Estimation

0 20 40 60 80 100 10-15

(a) australian

0 50 100 150 200 250 300 350 10-15

(b) mushrooms

0 5 10 15 20 25 30 35 40 10-15

(c) phishing

0 5 10 15 20 25 30 35 40 10-15

0 20 40 60 80 100 120 10-15

(e) australian

0 50 100 150 200 250 300 10-15

(f) mushrooms

0 10 20 30 40 50 10-15

(g) phishing

0 10 20 30 40 50 10-15

Figure 3: First row, performance comparison for solving ridge regression among diﬀerent algorithms. Second row, performance comparison for solving LASSO regression among diﬀerent algorithms. For both examples, step sizes are tuned automatically to minimize the number of iterations required to reach a suboptimality of 10 15.

As the problem is non-convex, we cannot measure convergence with respect to the global optimum x , so we use many iterations of proximal gradient descent with a small step size (η = 1 10Ln) to ﬁnd a reference point x . Every test is initialized using a random vector with normally distributed i.i.d. entries, and the same starting point is used for testing each value of θ. We found that small step sizes generally lead to stationary points with smaller objective values, so we set η = 1 5n for all our experiments. We report F(xk) F(x ) averaged over every n iterations. These experiments show that the performance of B-SAGA and B-SVRG varies signiﬁcantly with θ, with smaller values leading to better performance. SARAH and SARGE perform similarly to SAGA and SVRG in these experiments, see Figure 4. For the comparison of all algorithms, B-SAGA and B-SVRG provides the best performance with B-SVRG being slightly faster.

6. Conclusion

The complicated convergence proofs of biased stochastic gradient methods have restricted researchers to studying unbiased estimators almost exclusively. Our simple framework for proving convergence rates for biased algorithms overcomes this limitation. Our analysis allows for the study of biased algorithms with proximal support for minimizing convex, strongly convex, and non-convex objectives for the ﬁrst time. We also show that biased gradient estimators can oﬀer improvements over unbiased estimators in theory and in practice. Most notably, we ﬁnd that biased recursive gradient estimators, such as SARAH and SARGE, admit smaller bounds on their MSEs and faster convergence rates than SAGA and SVRG.

Driggs, Liang, Sch onlieb

0 100 200 300 400 500 10-3

(a) australian

0 200 400 600 800 100 10-3

(b) mushrooms

0 50 100 150 200 250 300 10-4

(c) phishing

0 200 400 600 800 100 10-3

0 100 200 300 400 500

(e) australian

0 100 200 300 400 500 10 -2

(f) mushrooms

0 100 200 300 400 500

(g) phishing

0 100 200 300 400 500 10 -2

Figure 4: First row, performance comparison for solving NN-PCA with diﬀerent choices of θ in B-SAGA. Second row, performance comparison for solving NN-PCA with diﬀerent choices of θ in B-SVRG. The step size for each case is set to η = 1 5Ln. The point x is found by solving the problem using proximal gradient descent.

0 100 200 300 400 500 10-3

(a) australian

0 100 200 300 400 500 10-3

(b) mushrooms

0 100 200 300 400 500 10-2

(c) phishing

0 100 200 300 400 500 10-3

Figure 5: Performance comparison for solving NN-PCA among diﬀerent algorithms. All step sizes are set to 1 5Ln. Objective values are averaged over each epoch (n steps).

Acknowledgements

CBS acknowledges support from the Leverhulme Trust project on Breaking the Non-Convexity Barrier and on Unveiling the Invisible, the Philip Leverhulme Prize, the EPSRC grant No. EP/M00483X/1, the EPSRC Centre No. EP/N014588/1, the European Union Horizon 2020 research and innovation programs under the Marie Sk lodowska-Curie grant agreement No. 691070, the Cantab Capital Institute for the Mathematics of Information, and the Alan Turing Institute. JL acknowledges the support from Leverhulme Trust and Newton Trust.

On Biased Stochastic Gradient Estimation

The organization of the appendix is as follows: we prove Theorems 17 and 18 in Appendices A and B, respectively, and we prove Theorem 20 in Appendix C. We provide convergence rates for B-SAGA and B-SVRG as special cases of Theorem 17 in Appendix D, and we provide convergence rates for SARAH and SARGE as special cases of Theorem 18 in Appendices E and F, respectively.

Appendix A. Proof of Theorem 17

To prove Theorem 17, we begin by showing that the BMSE property (Deﬁnition 9) implies the MSE of the gradient estimator over T iterations is proportional to PT 1 k=0 E xk+1 xk 2.

Lemma 32 (MSE bound) Suppose that the stochastic gradient estimator e satisﬁes the BMSE(M1, M2, ρM, ρF , m) property, let ρ = min{ρM, ρF }, and let σs be any sequence satisfying σs(1 ρ)ms σs 1(1 ρ

2)ms. For convenience, deﬁne Θ = M1ρF +2M2

ρMρF . The MSE of the gradient estimator is bounded as

k=ms E[ f(xk) e k 2] 2ΘL2 PS

k=ms E[ xk+1 xk 2].

Proof First, we derive a bound on the sequence Fms arising in the BMSE property. Summing this sequence from s = 0 to s = S,

s=0σs Fms PS

ℓ=0 M2σs(1 ρF )m(s ℓ)

i=1E[ fi(xk+1) fi(xk) 2]

ℓ=0 M2σℓ(1 ρF

i=1E[ fi(xk+1) fi(xk) 2]

i=1E[ fi(xk+1) fi(xk) 2]

i=1E[ fi(xk+1) fi(xk) 2].

Inequality 1 uses the fact that σs(1 ρF )ms σs 1(1 ρF

2 )ms. With this bound on Fms, we proceed to bound Mms similarly.

s=0 σs Mms 1 PS

s=0σs Fms + M1

i=1E[ fi(xk+1) fi(xk) 2]

+ (1 ρM)m PS

s=1 σs Mm(s 1) 2 PS

s=0σs M1ρF +2M2

i=1E[ fi(xk+1) fi(xk) 2]

s=1 σs 1Mm(s 1)

s=0σs M1ρF +2M2

i=1E[ fi(xk+1) fi(xk) 2]

s=1 σs 1 M1ρF +2M2

i=1E[ fi(xk+1) fi(xk) 2] +

s=0 σs M1ρF +2M2

i=1E[ fi(xk+1) fi(xk) 2]

i=1E[ fi(xk+1) fi(xk) 2]

k=ms E[ xk+1 xk 2].

Driggs, Liang, Sch onlieb

Inequality 1 uses the fact that Mm (1 ρM)m Mm(s 1). Inequality 2 uses σs(1 ρM)ms σs 1(1 ρM

2 )ms, 3 uses the same estimate we applied in (9), and 4 uses the Lipschitz continuity of fi.

Proof of Lemma 12 By assumption, 1 1

θ 0, so we can apply convexity to obtain

η θ (f(xk) f(x )) + η

i=1fi(ϕi k) fi(x )

θ f(xk), xk x + η

i=1 fi(ϕi k), ϕi k x

θ f(xk), xk x + η

i=1 fi(ϕi k), xk x + η

i=1 fi(ϕi k), ϕi k xk .

Because e k is memory-biased, hence 1

θ)Pn i=1 fi(ϕi k) = Ek[e k]. Therefore,

η θ f(xk), xk x + η

i=1 fi(ϕi k), xk x

= Ek η e k, xk x

= Ek[η e k, xk xk+1 + η e k, xk+1 x ]

Ek η e k, xk xk+1 1

2 xk+1 xk 2 + 1

2 xk x 2 1+µη

2 xk+1 x 2 ηg(xk+1) + ηg(x ) .

The inequality is due to Lemma 3 with z = xk+1, x = xk, d = e k, and y = x . Combining these two inequalities, we have shown

η θ (f(xk) f(x )) + η

i=1 fi(ϕi k) fi(x )

Ek η e k, xk xk+1 1

2 xk+1 xk 2 ηg(xk+1) + ηg(x )

2 xk x 2 1+µη

2 xk+1 x 2 + η

i=1 fi(ϕi k), ϕi k xk .

We bound the ﬁrst three terms on the right further.

η e k, xk xk+1 1

2 xk+1 xk 2 ηg(xk+1)

= η( f(xk), xk xk+1 g(xk+1)) + η e k f(xk), xk xk+1 1

2 xk+1 xk 2

1 η(f(xk) F(xk+1)) + η e k f(xk), xk xk+1 + ( ηL

2) xk+1 xk 2

2 η(f(xk) F(xk+1)) + η 2Lλ e k f(xk) 2 + ( ηL(λ+1)

2) xk+1 xk 2.

Inequality 1 is due to the Lipschitz continuity of f, and inequality 2 is Young s. Combining this bound with (10) and rearranging terms, we have shown that

0 ηEk[F(xk+1) F(x )] + η 2LλEk[ e k f(xk) 2]

2 Ek[ xk+1 x 2] + 1

2 xk x 2 + ( ηL(λ+1)

2)Ek[ xk+1 xk 2]

i=1 fi(ϕi k) + 1

i=1 fi(ϕi k), ϕi k xk .

We use Lemma 1 to bound the ﬁnal term, yielding the desired inequality.

Proof of Theorem 17 (Convex Case) We begin with the inequality of Lemma 12 with µ = 0. Multiplying the inequality of Lemma 4 with z = xk+1, x = xk, and d = e k by a

On Biased Stochastic Gradient Estimation

non-negative constant δ and adding it to the inequality of Lemma 12, we obtain

ηEk[F(xk+1) F(x ) + δ(F(xk+1) F(xk))]

2Lλ Ek[ e k f(xk) 2] 1

2Ek[ xk+1 x 2] + 1

+ ( ηL(1+δ)(λ+1)

2 )Ek[ xk+1 xk 2] + ηL

i=1 xk ϕi k 2.

Applying the full expectation operator and summing from k = 0 to k = T 1, we have

k=0 ηE[F(xk+1) F(x )] + ηδ(E[F(x T )] F(x0))

2E[ x T x 2] + 1

2 x0 x 2 + PT 1

k=0 E η(1+δ)

2Lλ e k f(xk) 2

+ ( ηL(1+δ)(λ+1)

2 ) xk+1 xk 2 + ηL

i=1 xk ϕi k 2 .

We use Lemma 32 with σs = 1 to bound the MSE, and we use the fact that the gradient estimator is memory-biased to bound the term 1/n Pn i=1 xk ϕi k 2. This leaves

k=0 E[F(xk+1) F(x )]

2E[ x T x 2] + 1

2 x0 x 2 + ηδ(F(x0) E[F(x T )])

+ ( ηL(1+δ)(λ+1)

2 + ΘηL(1+δ)

k=0 E[ xk+1 xk 2].

Setting λ =

2Θ minimizes the coeﬃcient of the term on the ﬁnal line. With

2Θ+ B1(1 1/θ)

the ﬁnal term in (12) is non-positive, so we can drop it from the inequality along with the term 1/2E x T x 2. Using the fact that F(x T ) F(x ), this leaves

PT 1 k=0 E[F(xk+1) F(x )] 1 2η x0 x 2 + δ(F(x0) F(x )).

We use the convexity of F to rewrite this inequality as a bound on the suboptimality of the average iterate

E[F( x T ) F(x )] 1

PT 1 k=0 E[F(xk+1) F(x )] 1 2ηT x0 x 2 + δ

T (F(x0) F(x )).

Setting δ = max{B1(1 1/θ)/

2Θ 1, 0} approximately minimizes the right side of this inequality, completing the proof.

Proof of Theorem 17 (Strongly Convex Case) As in the proof of the convex case, we begin with the inequality of Lemma 12, multiply the inequality of Lemma 4 with z = xk+1, x = xk, and d = e k by a non-negative constant δ, and add the two inequalities.

ηEk[F(xk+1) F(x ) + δ(F(xk+1) F(xk))]

2 Ek[ xk+1 x 2] + 1

2 xk x 2 + Ek h η(1+δ)

2Lλ e k f(xk) 2

+ ( ηL(1+δ)(λ+1)

2 1+δ(2+µη)

2 ) xk+1 xk 2 + ηL

i=1 xk ϕi k 2i .

Driggs, Liang, Sch onlieb

Applying the full expectation operator, multiplying by (1 + µη)k, and summing over the epoch k = ms to k = m(s + 1) 1 for some s N0, we have

η Pm(s+1) 1

k=ms (1 + µη)k E[F(xk+1) F(x ) + δ(F(xk+1) F(xk))]

(1+µη)m(s+1)

2 E xm(s+1) x 2 + (1+µη)ms

2 E xms x 2

+ Pm(s+1) 1

k=ms (1 + µη)k E h η(1+δ)

2Lλ e k f(xk) 2 + ( ηL(1+δ)(λ+1)

2 1+δ(2+µη)

2 ) xk+1 xk 2

i=1 xk ϕi k 2i .

Using the fact that η 1 µm,

(1+µη)k (1+µη)m(s+1) (1+µη)ms lim m (1+ 1

m)m = e(1+µη)ms 3(1+µη)ms, (13)

where e is Euler s number. Therefore,

η Pm(s+1) 1

k=ms (1 + µη)k E[F(xk+1) F(x ) + δ(F(xk+1) F(xk))]

(1+µη)m(s+1)

2 E[ xm(s+1) x 2] + (1+µη)ms

+ (1 + µη)ms Pm(s+1) 1

k=ms E h ( 3ηL(1+δ)(λ+1)

2 1+δ(2+µη)

2 ) xk+1 xk 2

2Lλ e k f(xk) 2 + 3ηL

i=1 xk ϕi k 2i .

Summing the inequality from epoch s = 0 to s = S 1,

k=0 (1 + µη)k E[F(xk+1) F(x ) + δ(F(xk+1) F(xk))]

s=0 (1 + µη)ms m(s+1) 1 X

k=ms E h 3η(1+δ)

2Lλ e k f(xk) 2 + ( 3ηL(1+δ)(λ+1)

2 1+δ(2+µη)

2 ) xk+1 xk 2

i=1 xk ϕi k 2i (1+µη)m S

2 E xms x 2 + 1

We use Lemma 32 with σs = (1 + µη)ms to bound the MSE. Recall ρ = min{ρM, ρF } and η ρ 2µ. This choice for σs satisﬁes the conditions of Lemma 32 because (1+µη)ms(1 ρ)ms (1 + µη)m(s 1)(1 ρ/2)ms. We use the fact that the gradient estimator is memory-biased to bound the term 1/n Pn i=1 xk ϕi k 2. This leaves

k=0 (1 + µη)k E[F(xk+1) F(x ) + δ(F(xk+1) F(xk))]

2 E[ xm S x 2] + 1

2 x0 x 2 + C PS 1

s=0 (1 + µη)ms Pm(s+1) 1

k=ms E[ xk+1 xk 2], (15) where C = 3ηL(1+δ)(λ+1)

2 + 3ΘηL(1+δ)

θ) 1+δ(2+µη)

2 . We must choose η, λ, and δ so that C 0. Setting λ =

2Θ minimizes C over λ. Using the approximation δ(2 + µη) δ, we see that C is non-positive if

2Θ+ B1(1 1/θ)

Setting δ = max{B1(1 1/θ)/

2Θ 1, 0}, we are guaranteed that

2Θ) 1 3L(1+2

2Θ+ B1(1 1/θ)

On Biased Stochastic Gradient Estimation

so the step size in the theorem statement ensures C 0, and the ﬁnal term in (15) is non-positive. Dropping this non-positive term from the inequality, we have

η(1 + δ) Pm S 1

k=0 (1 + µη)k E[F(xk+1) F(x )] + δη Pm S 1

k=0 (1 + µη)k E[F(xk) F(x )]

2 E[ xm S x 2] + 1

2 x0 x 2. (16)

We would like to show that 1 + δ (1 + µη)δ so that the terms on the ﬁrst line telescope. We use the fact that η

2Θ B1µ(1 1/θ) to say

1 µη B1(1 1/θ)

δ 1 + µη and inequality (16) simpliﬁes to

(1 + µη)m SE[ηδ(F(xm S) F(x )) + 1

2 xm S x 2] ηδ(F(x0) F(x )) + 1

which implies the result.

Appendix B. Proof of Theorem 18

The following two lemmas establish an analogue of Lemma 12 for recursively biased estimators.

Lemma 33 Suppose e is recursively biased with parameters ρB and ν. Suppose g is µstrongly convex with µ 0, and let λ > 0 be a constant whose value we determine later. The following inequality holds:

0 ηEk[F(xk+1) F(x )] + η 2LλEk[ e k f(xk) 2]

2 Ek[ xk+1 x 2] + 1

+ ( ηL(λ+1)

2)Ek[ xk+1 xk 2] + η(1 ρB) f(xk 1) e k 1, xk x .

Proof Applying the convexity of f yields

η(f(xk) f(x ))

η f(xk), xk x

= η f(xk) (1 ρB)( f(xk 1) e k 1), xk x + η(1 ρB) f(xk 1) e k 1, xk x .

Because the estimator is recursively biased,

Ek[e k] = f(xk) (1 ρB)( f(xk 1) e k 1).

η f(xk) (1 ρB)( f(xk 1) e k 1), xk x

= Ek[η e k, xk x ]

= Ek[η e k, xk xk+1 + η e k, xk+1 x ]

Ek[η e k, xk xk+1 1

2 xk+1 xk 2 + 1

2 xk+1 x 2 + ηg(xk+1) ηg(x )],

Driggs, Liang, Sch onlieb

The inequality is due to Lemma 3. The rest of the proof follows the proof of Lemma 12.

Proof of Lemma 15 We begin with a technical lemma.

Lemma 34 Let {Eℓ} ℓ=0 be a sequence of non-negative real numbers. For any ρ (0, 1] and c 1, the following inequality holds:

Pc(s+1) 1 k=cs+2

Pk 1 ℓ=cs+1 (1 ρ)k ℓ 1Eℓ min {c, 1/ρ} Pc(s+1) 1 k=cs+1 Ek.

Proof This inequality can be seen by expanding the sums.

ℓ=cs+1(1 ρ)k ℓ 1Eℓ

= [Ecs+1] + [(1 ρ)Ecs+1 + Ecs+2] + (1 ρ)2Ecs+1 + (1 ρ)Ecs+2 + Ecs+3 +

+ (1 ρ)c 2Ecs+1 + + Ec(s+1) 2

= 1 + (1 ρ) + + (1 ρ)c 2 Ecs+1 + 1 + (1 ρ) + + (1 ρ)c 3 Ecs+2 + .

The coeﬃcient of each term of the sequence Eℓis less than c 2. Therefore, replacing each coeﬃcient with c, we obtain the bound

Pc(s+1) 1 k=cs+2

Pk 1 ℓ=cs+1 (1 ρ)k ℓ 1Eℓ c Pc(s+1) 1 k=cs+1 Ek.

This proves the ﬁrst bound. For the second bound, notice that each coeﬃcient is less than P k=0(1 ρ)k = 1/ρ, proving the second bound.

First, we use the fact that e k is recursively biased.

E f(xk 1) e k 1, xk x

1 = E[ f(xk 1) e k 1, xk xk 1 + f(xk 1) Ek 1 e k 1, xk 1 x ]

2 f(xk 1) e k 1 2 + 1

2ϵ xk xk 1 2 + f(xk 1) Ek 1 e k 1, xk 1 x

2 f(xk 1) e k 1 2 + 1

2ϵ xk xk 1 2 + (1 ρB) f(xk 2) e k 2, xk 1 x .

We can pass the conditional expectation Ek 1 into the second inner-product in 1 because xk 1 is independent of jk 1. Inequality 2 is Young s, and 3 uses the deﬁnition of a recursively biased gradient estimator. This is a recursive inequality, and expanding the recursion gives

E f(xk 1) e k 1, xk x Pk 1

ℓ=νs+1(1 ρB)k ℓ 1E ϵ

2 f(xℓ) e ℓ 2 + 1

2ϵ xℓ+1 xℓ 2

+ (1 ρB) f(xνs) e νs, xνs+1 x

ℓ=νs+1(1 ρB)k ℓ 1E ϵ

2 f(xℓ) e ℓ 2 + 1

2ϵ xℓ+1 xℓ 2 .

On Biased Stochastic Gradient Estimation

Equality 1 is due to the fact that e νs = f(xνs). Taking the absolute value and summing this from k = νs + 1 to k = ν(s + 1) 1,

k=νs+1 |E f(xk 1) e k 1, xk x |

ℓ=νs+1 (1 ρB)k ℓ 1E ϵ

2 f(xℓ) e ℓ 2 + 1

2ϵ xℓ+1 xℓ 2

2 min n ν, 1

o Pν(s+1) 1

2 f(xk) e k 2 + 1

2ϵ xk+1 xk 2 .

Inequality 2 follows from the technical Lemma 34. Summing this inequality from s = 0 to s = S completes the proof.

Proof of Theorem 18 (Convex Case) To begin, we sum the inequality of Lemma 33 and the inequality of Lemma 4 scaled by δ > 0 with z = xk+1, x = xk, and d = e k.

ηEk[F(xk+1) F(x ) + δ(F(xk+1) F(xk))]

2Ek[ xk+1 x 2] + 1

2 xk x 2 + Ek h η(1+δ)

2Lλ e k f(xk) 2

+ (1 + δ)( ηL(λ+1)

2) xk+1 xk 2 + η(1 ρB) f(xk 1) e k 1, xk x i .

Applying the full expectation operator, setting µ = 0, and summing from k = 0 to k = T 1 where T = m S for some S N, we have

k=0 E[F(xk+1) F(x )] + ηδE[F(x T ) F(x0)]

2E[ x T x 2] + 1

2 x0 x 2 + PT 1

k=0 E h η(1+δ)

2Lλ e k f(xk) 2

+ (1 + δ)( ηL(λ+1)

2) xk+1 xk 2 + η(1 ρB) f(xk 1) e k 1, xk x i .

We use Lemma 15 to bound the inner-product bias term.

k=0 E[F(xk+1) F(x )] + ηδE[F(x T ) F(x0)]

2E[ x T x 2] + 1

2 x0 x 2 + PT 1

k=0 E h ( η(1+δ)

2Lλ + B2η(1 ρB)ϵ

2 ) e k f(xk) 2

+ (1 + δ)( ηL(λ+1)

2 + B2η(1 ρB)

2) xk+1 xk 2i .

To bound the MSE, we use Lemma 32 with σs = 1. This leaves

k=0 E[F(xk+1) F(x )] + ηδE[F(x T ) F(x0)]

2E[ x T x 2] + 1

2 x0 x 2 + w PT 1

k=0 E[ xk+1 xk 2], (18)

where w = ηL(λ+1)(1+δ)

2 + B2η(1 ρB)

2ϵ + ΘηL(1+δ)

λ + B2ηL2(1 ρB)ϵΘ 1+δ

2 . To minimize the coeﬃcient of the ﬁnal term, we set λ =

2Θ and ϵ = (2L2Θ) 1/2. This coeﬃcient is then equal to

2ΘηL(1 + δ) + ηL(1+δ)

2(1 ρB)ηLB2

Driggs, Liang, Sch onlieb

which is non-positive when η 1 2

2ΘL(1+ (1 ρB)B2

1+δ )+L. This ensures that the ﬁnal term

in (18) is non-positive, so we can drop it from the inequality along with the term 1/2E x T x 2. This leaves

PT 1 k=0 E[F(xk+1) F(x )] 1

2η x0 x 2 + δE[F(x0) F(x T )].

By the convexity of F and the fact that F(x T ) F(x )

E[F( x T ) F(x )] 1

PT 1 k=0 E[F(xk+1) F(x )] 1 2ηT x0 x 2 + δ

T (F(x0) F(x )).

We now choose δ so that this upper-bound is approximately minimized. Assuming x0 x 2

and F(x0) F(x ) are approximately equal to K > 0, the right side becomes

2ΘL(1 ρB)B2

1 + δ + δ + 2

Choosing δ = max{ q

Θ(1 ρB)B2 1, 0} approximately minimizes the right side of this inequality, completing the proof.

Proof of Theorem 18 (Strongly Convex Case) We begin with inequality (17), but without setting µ = 0.

η(1 + δ)Ek[F(xk+1) F(x )] + 1+µη

2 Ek xk+1 x 2

ηδ(F(xk) F(x )) + 1

2 xk x 2 + Ek h η(1+δ)

2Lλ e k f(xk) 2

+ (1 + δ)( ηL(λ+1)

2) xk+1 xk 2 + η(1 ρB) f(xk 1) e k 1, xk x i .

Applying the full expectation operator, multiplying by (1 + µη)k, and summing over the epoch k = ms to k = m(s + 1) 1 for some s N0, we have

η(1 + δ) Pm(s+1) 1

k=ms (1 + µη)k E[F(xk+1) F(x )] + (1+µη)m(s+1)

2 E[ xm(s+1) x 2]

ηδ Pm(s+1) 1

k=ms (1 + µη)k E[F(xk) F(x )] + (1+µη)ms

2 E[ xms x 2]

+ Pm(s+1) 1

k=ms (1 + µη)k E h η(1+δ)

2Lλ e k f(xk) 2 + (1 + δ)( ηL(λ+1)

2) xk+1 xk 2

+ η(1 ρB) f(xk 1) e k 1, xk x i .

Choosing δ so that 1+δ (1+µη)δ, the terms involving F(xk+1) F(x ) telescope, giving the inequality

(1 + µη)m(s+1)E[δη(F(xm(s+1)) F(x )) + 1

2 xm(s+1) x 2]

δη(1 + µη)ms E[F(xms) F(x )] + (1+µη)ms

2 E xms x 2

+ Pm(s+1) 1

k=ms (1 + µη)k E h η(1+δ)

2Lλ e k f(xk) 2 + (1 + δ)( ηL(λ+1)

2) xk+1 xk 2

+ η(1 ρB) f(xk 1) e k 1, xk x i .

On Biased Stochastic Gradient Estimation

We would like to bound the inner-product bias term using Lemma 15, and we can do this after some manipulation. Because η 1 µm, we have (1 + µη)k (1 + 1

m)k ms(1 + µη)ms 3(1 + µη)ms. Using the same estimate as in equations (13) and (14), we can say

Pm(s+1) 1 k=ms (1 + µη)k E[ f(xk 1) e k 1, xk x ]

3(1 + µη)ms Pm(s+1) 1 k=ms |E[ f(xk 1) e k 1, xk x ]|,

which produces the inequality

(1 + µη)m(s+1)E[δη(F(xm(s+1)) F(x )) + 1

2 xm(s+1) x 2]

δη(1 + µη)ms E[F(xms) F(x )] + (1+µη)ms

2 E xms x 2

+ (1 + µη)ms Pm(s+1) 1

k=ms E h 3η(1+δ)

2Lλ e k f(xk) 2 + (1 + δ)( 3ηL(λ+1)

2) xk+1 xk 2i

+ 3η(1 ρB)|E f(xk 1) e k 1, xk x | .

Summing this inequality from s = 0 to s = S 1,

(1 + µη)m SE[δη(F(xm S) F(x )) + 1

2 xm S x 2]

δη(F(x0) F(x )) + 1

s=0 (1 + µη)ms Pm(s+1) 1

k=ms E h (1 + δ)( 3ηL(λ+1)

2) xk+1 xk 2i

2Lλ e k f(xk) 2 + 3η(1 ρB)|E f(xk 1) e k 1, xk x | .

We use Lemma 32 with σs = (1 + µη)ms to bound the MSE and Lemma 15 to bound the inner-product bias term.

(1 + µη)m SE[δη(F(xm S) F(x )) + 1

2 xm S x 2]

δη(F(x0) F(x )) + 1

2 x0 x 2 + w PS 1

k=ms (1 + µη)ms E xk+1 xk 2, (19)

where w = 3ηL(λ+1)(1+δ)

2 + 3B2η(1 ρB)

2ϵ + 3ΘηL(1+δ)

λ + 3B2ηL2(1 ρB)ϵΘ 1+δ

2 . To minimize the coeﬃcient of the ﬁnal term, we set λ =

2Θ and ϵ = (2L2Θ) 1/2. This coeﬃcient is then equal to

2ΘηL(1 + δ) + 3ηL(1+δ)

2(1 ρB)ηLB2

2ΘL(1+ (1 ρB)B2

this term is non-positive. Setting δ = max{(1 ρB)B2 1, 0}, we are assured that

2ΘL(1+ (1 ρB)B2

so the ﬁnal term in (19) is non-positive, and we can drop it from the inequality. The resulting inequality is

(1 + µη)T E[δη(F(x T ) F(x )) + 1

2 x T x 2] δη(F(x0) F(x )) + 1

Driggs, Liang, Sch onlieb

All that remains is to show that our choice for δ satisﬁes (1 + δ) (1 + µη)δ. Using the fact that η 1 (1 ρB)B2µ we can say

1 µη (1 ρB)B2 δ

which ensures that (1 + δ) (1 + µη)δ and concludes the proof.

Appendix C. Proof of Theorem 20

Theorem 20 follows immediately from inequality (7) and the MSE bound of Lemma 32. Proof of Theorem 20 Summing inequality (7) from k = 0 to k = T 1 and applying the full expectation operator, we obtain

0 E[F(x T )] + F(x0) + (L 1 4η) PT 1

k=0 E[ ˆxk+1 xk 2]

2 1 4η) PT 1

k=0 E[ xk+1 xk 2] + 2η PT 1

k=0 E[ f(xk) e k 2].

We bound the MSE using Lemma 32 with σs = 1.

0 E[F(x T )] + F(x0) + (L 1 4η) PT 1

k=0 E ˆxk+1 xk 2

2 + 4ΘηL2 1 4η) PT 1

k=0 E xk+1 xk 2.

With η 16Θ+1 1

16LΘ , the ﬁnal term is non-positive, so we can drop it from the inequality. Using the fact that F(x T ) F(x ), our inequality simpliﬁes to

4η) PT 1 k=0 E[ ˆxk+1 xk 2] F(x0) F(x ).

Writing the left side in terms of the generalized gradient, we have the bound

k=0 E[ Gη/2(xk) 2] 16(F(x0) F(x ))

With xα chosen uniformly at random from the set {xk}T 1 k=0 , this is equivalent to

E[ Gη/2(xα) 2] 16(F(x0) F(x ))

η(1 4ηL)T .

This completes the proof.

Appendix D. Proofs of convergence rates for B-SAGA and B-SVRG

The following lemma establishes an MSE bound on the B-SAGA and B-SVRG gradient estimators. For the unbiased case θ = 1, this result was essentially ﬁrst proved in (Defazio et al., 2014a), but the authors ultimately use a looser variance bound.

On Biased Stochastic Gradient Estimation

Lemma 35 The MSEs of the B-SAGA and B-SVRG gradient estimators satisfy

Ek[ e k f(xk) 2] 1 nθ2

i=1 fi(xk) fi(ϕi k) 2 + (1 2

i=1 fi(ϕi k) 2.

Proof Let e k e B-SAGA k or e B-SVRG k . The proof amounts to computing the expectation of the estimator and applying the Lipschitz continuity of fi.

Ek[ e k f(xk) 2] = Ek[ 1

θ( fjk(xk) fjk(ϕjk k )) f(xk) 1

i=1 fi(ϕi k) 2]

θ2 Ek[ fjk(xk) fjk(ϕjk k ) 2] + f(xk) 1

i=1 fi(ϕi k) 2

θEk[ fjk(xk) fjk(ϕjk k ), f(xk) 1

i=1 fi(ϕi k) ]

i=1 fi(xk) fi(ϕi k) 2 + (1 2

i=1 fi(ϕi k) 2,

which is the desired result.

The following two lemmas establish the constants in the BMSE property for the B-SAGA and B-SVRG estimators. Proof of Lemma 22 We begin with the inequality of Lemma 35 and consider two cases.

Case 1. Suppose θ [1, 2]. In this case the second term in (20) is non-positive, so we drop it from the inequality. For the remaining term, we use the following bound.

i=1 E[ fi(xk) fi(ϕi k) 2]

i=1 E[ fi(xk) fi(xk 1) 2] + 1

i=1 E[ fi(xk 1) fi(ϕi k) 2]

i=1 E[ fi(xk) fi(xk 1) 2] + 1

i=1 E[ fi(xk 1) fi(ϕi k 1) 2]

i=1 E[ fi(xk) fi(xk 1) 2] + 1

i=1 E[ fi(xk 1) fi(ϕi k 1) 2].

(21) Inequality 1 is the standard inequality a c 2 (1 + δ) a b 2 + (1 + δ 1) b c 2

(where we let δ = 1 2n). Inequality 2 follows from the deﬁnition of ϕi k and computing the expectation over jk 1, and 3 uses the fact that (1 + 1 2n)(1 1

n) (1 1 2n). Altogether, this gives

E[ e SAGA k f(xk) 2]

i=1 E[ fi(xk) fi(ϕi k) ]

i=1 E[ fi(xk) fi(xk 1) 2] + 1 nθ2 (1 1

i=1 E[ fi(xk 1) fi(ϕi k 1) 2].

With Mk = 1 nθ2 Pn i=1 E[ fi(xk) fi(ϕi k) 2], it is clear that the SAGA estimator satisﬁes the BMSE property with M1 = 2n+1

θ2 , ρM = 1 2n, M2 = 0, ρF = 1, and m = 1.

Driggs, Liang, Sch onlieb

Case 2. Suppose θ > 2, so that the second term in (20) is non-negative. Jensen s inequality gives Ek[ e k f(xk) 2] 1

θ)2 Pn i=1 fi(xk) fi(ϕi k) 2.

Following the argument of Case 1, it is easy to see that the B-SAGA gradient estimator satisﬁes the BMSE property with Mk = 1 n(1 1

θ)2 Pn i=1 fi(xk) fi(ϕi k) 2, M1 = (2n + 1)(1 1

θ)2, ρM = 1 2n, M2 = 0, ρF = 1, and m = 1. To prove that the B-SAGA is memory-biased, we need computing its expectation.

f(xk) Ek[e B-SAGA k ] = f(xk) 1

θEk[ fjk(xk) fi(ϕjk k )] 1

i=1 fi(ϕi k)

i=1 fi(ϕi k) .

To compute a value for B1, we follow (21) to obtain

i=1 E[ xk ϕi k 2] (2n + 1) xk xk 1 2 + 1

n(1 1 2n) Pn

i=1 E[ xk 1 ϕi k 1 2]

(2n + 1) Pk

ℓ=1 (1 1 2n)k ℓ xℓ xℓ 1 2.

Summing this inequality from k = 0 to k = T 1, we obtain

i=1E[ xk ϕi k 2] (2n + 1) PT 1

ℓ=1(1 1 2n)k ℓ xℓ xℓ 1 2

ℓ=0(1 1 2n)ℓ PT 1

k=0 xk+1 xk 2

= 2n(2n + 1) PT 1

k=0 xk+1 xk 2,

which completes the proof.

Proof of Lemma 23 Suppose k {ms, ms + 1, m(s + 1) 1} for some s N0. As in the proof of Lemma 22, we begin with the inequality of Lemma 35 and consider two cases.

Case 1. Suppose θ [1, 2], so that we may drop the second term in (20). We can bound the remaining term as follows.

i=1 fi(xk) fi(ϕs) 2

i=1 fi(xk) fi(xk 1) 2 + 1+1/m

i=1 fi(xk 1) fi(ϕs) 2

ℓ=ms (1 + 1

i=1 fi(xℓ+1) fi(xℓ) 2.

Inequality 1 uses the inequality u w 2 (1 + 1/m) u v 2 + (1 + m) v w 2, and

2 follows from the fact that xms = ϕs. Summing this inequality from k = ms to k = m(s + 1) 1 gives

i=1 fi(xk) fi(ϕs) 2 m + 1

m)m m(s+1) 1 X

i=1 fi(xℓ+1) fi(xℓ) 2

m)m m(s+1) 1 X

i=1 fi(xk+1) fi(xk) 2

k=ms fi(xk+1) fi(xk) 2.

On Biased Stochastic Gradient Estimation

The ﬁnal inequality uses the fact that (1 + 1

m)m < limm (1 + 1

m)m = e < 3. From this inequality, it is clear that the B-SVRG gradient estimator satisﬁes the BMSE property with M1 = 3m(m+1)

θ2 , ρM = 1, M2 = 0, and ρF = 1.

Case 2. If θ > 2, then applying Jensen s inequality to (20) produces

Ek[ e B-SVRG k f(xk) 2] 1

θ)2 Pn i=1 fi(xk) fi(ϕs) 2.

A similar argument to the one in Case 1 shows that M1 = 3m(m + 1)(1 1

θ)2, ρM = 1, M2 = 0, and ρF = 1. All that is left is to prove the stated value for B1. Following the proof in Case 1,

k=ms xk ϕs 2 Pm(s+1) 1

ℓ=ms (1 + m)(1 + 1

m)m xℓ+1 xℓ 2

3m(m + 1) Pm(s+1) 1

k=ms xk+1 xk 2.

Summing over the epochs s = 0 to s = S shows B1 = 3m(m + 1).

Combining Lemmas 22 and 23 with Theorems 17 and 20 proves convergence rates for B-SAGA and B-SVRG.

Appendix E. Proof of convergence rates for SARAH

Lemma 27 establishes the BMSE constants for the SARAH estimator. The convergence rates of Corollary 29 then follow immediately from Theorem 18. Proof of Lemma 27 Let k {ms + 1, ms + 2, , m(s + 1) 1}. The claim follows immediately from the well-known bound on the MSE of the SARAH gradient estimator

e SARAH k f(xk) 2 1

Pn i=1 fi(xℓ+1) fi(xℓ) 2.

We refer to Fang et al. (2018) for a proof of this inequality. Summing over an epoch and applying the estimate

i=1 fi(xℓ+1) fi(xℓ) 2 m

i=1 fi(xk+1) fi(xk) 2

completes the proof.

Appendix F. Proof of convergence rates for SARGE

For our analysis, we write the SARGE gradient estimator in terms of the SAGA estimator. Deﬁne the estimator

e ξ-SAGA k def = fjk(xk 1) fjk(ξjk k ) + 1

Pn i=1 fi(ξi k),

Driggs, Liang, Sch onlieb

where the variables {ξi k}n i=1 follow the update rules ξjk k+1 = xk 1 and ξi k+1 = ξi k for all i = jk. The SARGE estimator is equal to

e SARGE k = e SAGA k (1 1

n)(e ξ-SAGA k e SARGE k 1 ).

Before we prove Lemma 28, we require a bound on the MSE of the ξ-SAGA gradient estimator that follows immediately from Lemma 35.

Lemma 36 The MSE of the ξ-SAGA gradient estimator satisﬁes the following bound:

E[ e ξ-SAGA k f(xk 1) 2] 3 Pk 1 ℓ=1 (1 1

2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2].

Proof Following the proof of Lemma 35,

Ek[ e ξ-SAGA k f(xk 1) 2] = Ek[ fjk(xk 1) fjk(ξjk k ) f(xk 1) + 1

i=1 fi(ξi k) 2]

i=1 fi(xk 1) fi(ξi k) 2 f(xk 1) 1

i=1 fi(ξi k) 2

i=1 fi(xk 1) fi(ξi k) 2.

Equality 1 is the standard variance decomposition. To continue, we follow the proof of Lemma 35.

E[ e ξ-SAGA k f(xk 1) 2]

i=1 E[ fi(xk 1) fi(ξi k) 2]

i=1 E[ fi(xk 1) fi(xk 2) 2] + 1

n(1 + 1 2n) Pn

i=1 E[ fi(xk 2) fi(ξi k) 2]

i=1 E[ fi(xk 1) fi(xk 2) 2]

n(1 + 1 2n)(1 1

i=1 E[ fi(xk 2) fi(ξi k 1) 2]

i=1 E[ fi(xk 1) fi(xk 2) 2] + 1

n(1 1 2n) Pn

i=1 E[ fi(xk 2) fi(ξi k 1) 2]

ℓ=1 (1 1 2n)k ℓ 1 Pn

i=1 E[ fi(xℓ) fi(xℓ 1) 2].

Equality 2 follows from computing expectations, and 3 uses the estimate (1 1

Due to the recursive nature of the SARGE gradient estimator, its MSE depends on the diﬀerence between the current estimate and the estimate from the previous iteration. The next lemma provides a bound on E e SARGE k e SARGE k 1 2.

Lemma 37 The SARGE gradient estimator satisﬁes the following bound:

E[ e SARGE k e SARGE k 1 2] 12

i=1 E[ fi(xk) fi(xk 1) 2] + 3 2n2 E f(xk 1) e SARGE k 1 2

ℓ=1 (1 1 2n)k ℓPn

i=1 E fi(xℓ) fi(xℓ 1) 2.

Proof To begin, we use the standard inequality a c 2 (1+δ) a b 2+(1+δ 1) b c 2

for any δ > 0 twice. For simplicity, we set δ = p

3/2 1 and use the fact that 1+ 1

On Biased Stochastic Gradient Estimation

for both applications of this inequality.

E[ e SARGE k e SARGE k 1 2]

= E[ e SAGA k (1 1

n)(e ξ-SAGA k e SARGE k 1 ) e SARGE k 1 2]

6E[ e SAGA k e ξ-SAGA k 2] +

2n2 E[ e ξ-SAGA k e SARGE k 1 2]

6E[ e SAGA k e ξ-SAGA k 2] + 6

2n2 E[ e ξ-SAGA k f(xk 1) 2] + 3 2n2 E[ f(xk 1) e SARGE k 1 2]. (22) We use 6

2n2 9 n2 to simplify the coeﬃcient of the second term. We now bound the ﬁrst two of these three terms separately. Consider the ﬁrst term.

6E[ e SAGA k e ξ-SAGA k 2]

= 6E[ fjk(xk) fjk(ϕjk k ) + 1

i=1 fi(ϕi k) fjk(xk 1) fjk(ξjk k ) 1

i=1 fi(ξi k) 2]

12E[ fjk(xk) fjk(xk 1) 2]

+ 12E[ fjk(ϕjk k ) fjk(ξjk k ) 1

i=1 fi(ϕi k) + 1

i=1 fi(ξi k) 2]

1 = 12E[ fjk(xk) fjk(xk 1) 2]

+ 12E[ fjk(ϕjk k ) fj(ξjk k ) 2] 12E[ 1

i=1 fi(ϕi k) 1

i=1 fi(ξi k) 2]

i=1 E[ fi(xk) fi(xk 1) 2] + 12E[ fjk(ϕjk k ) fjk(ξjk k ) 2]

i=1 E[ fi(xk) fi(xk 1) 2] + 12E[ fjk(ϕjk k ) fjk(ξjk k ) 2].

Equality 1 is the standard variance decomposition, which states that for any random variable X, E[ X EX 2] = E[ X 2] E[X] 2. The second term can be reduced further by computing the expectation. The probability that fjk(ϕjk k ) = fjk 1(xk 1) is equal to the probability that jk = jk 1, which is 1/n. The probability that fjk(ϕjk k ) = fjk 2(xk 2) is equal to the probability that jk = jk 1 and jk = jk 2, which is 1

n). Continuing in this way,

E[ fjk(ϕjk k ) fjk(ξjk k ) 2]

n E[ fjk 1(xk 1) fjk 1(xk 2) 2] + 1

n)E[ fjk 2(xk 2) fjk 3(xk 2) 2] +

n)k ℓ 1E[ fjℓ(xℓ) fjℓ(xℓ 1) 2].

This implies that

12E[ fjk(ϕjk k ) fjk(ξjk k ) 2] 12

i=1 E[ fi(xℓ) fi(xℓ 1) 2]

ℓ=1 (1 1 2n)k ℓ 1 Pn

i=1 E[ fi(xℓ) fi(xℓ 1) 2].

We include the second inequality to simplify later arguments. This completes our bound for the ﬁrst term of (22). For the second term of (22), we recall Lemma 36.

E[ e ξ-SAGA k f(xk 1) 2] 3 Pk 1 ℓ=1 (1 1

2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2].

Combining all of these bounds, we obtain

E[ e SARGE k e SARGE k 1 2] 12

i=1 E[ fi(xk) fi(xk 1) 2] + 3 2n2 E[ f(xk 1) e SARGE k 1 2]

ℓ=1 (1 1 2n)k ℓ 1 Pn

i=1 E[ fi(xℓ) fi(xℓ 1) 2],

Driggs, Liang, Sch onlieb

which completes the proof.

Lemma 37 allows us to take advantage of the recursive structure of our gradient estimate. With this lemma established, we can prove a bound on the MSE.

Lemma 38 The SARGE gradient estimator satisﬁes the following recursive bound:

E[ e SARGE k f(xk) 2]

n + 3 2n2 )E[ e SARGE k 1 f(xk 1) 2] + 12

i=1 E[ fi(xk) fi(xk 1) 2]

ℓ=1 (1 1 2n)k ℓ 1 Pn

i=1 E[ fi(xℓ) fi(xℓ 1) 2].

Proof The beginning of our proof is similar to the proof of the variance bound for the SARAH gradient estimator in (Nguyen et al., 2017, Lem. 2).

Ek e SARGE k f(xk) 2

= Ek[ e SARGE k 1 f(xk 1) + f(xk 1) f(xk) + e SARGE k e SARGE k 1 2]

= e SARGE k 1 f(xk 1) 2 + f(xk 1) f(xk) 2 + Ek[ e SARGE k e SARGE k 1 2]

+ 2 f(xk 1) e SARGE k 1 , f(xk) f(xk 1)

2 f(xk 1) e SARGE k 1 , Ek[e SARGE k e SARGE k 1 ]

2 f(xk) f(xk 1), Ek[e SARGE k e SARGE k 1 ] .

We consider each inner product separately. The ﬁrst inner product is equal to

2 f(xk 1) e SARGE k 1 , f(xk) f(xk 1)

= f(xk 1) e SARGE k 1 2 f(xk) f(xk 1) 2 + f(xk) e SARGE k 1 2.

For the next two inner products, we use the fact that

Ek[e SARGE k e SARGE k 1 ] = Ek e SAGA k (1 1

n)e ξ-SAGA k + (1 1

n)e SARGE k 1 e SARGE k 1 = f(xk) (1 1

n) f(xk 1) 1

n e SARGE k 1 = f(xk) f(xk 1) + 1

n( f(xk 1) e SARGE k 1 ).

With this equality established, we see that the second inner product is equal to

2 f(xk 1) e SARGE k 1 , Ek[e SARGE k e SARGE k 1 ]

= 2 f(xk 1) e SARGE k 1 , f(xk) f(xk 1) 2

n f(xk 1) e SARGE k 1 , f(xk 1) e SARGE k 1

= f(xk 1) e SARGE k 1 2 + f(xk) f(xk 1) 2

f(xk) e SARGE k 1 2 2

n f(xk 1) e SARGE k 1 2

n) f(xk 1) e SARGE k 1 2 + f(xk) f(xk 1) 2 f(xk) e SARGE k 1 2.

The third inner product can be bounded using a similar procedure.

2 f(xk) f(xk 1), Ek[e SARGE k e SARGE k 1 ]

= 2 f(xk) f(xk 1), f(xk) f(xk 1) 2

n f(xk) f(xk 1), f(xk 1) e SARGE k 1

2 f(xk) f(xk 1) 2 + 1

n f(xk) f(xk 1) 2 + 1

n f(xk 1) e SARGE k 1 2

n) f(xk) f(xk 1) 2 + 1

n f(xk 1) e SARGE k 1 2,

On Biased Stochastic Gradient Estimation

where the inequality is Young s. Altogether and after applying the full expectation operator, we have

E[ e SARGE k f(xk) 2] (1 1

n)E[ e SARGE k 1 f(xk 1) 2]

n)E[ f(xk) f(xk 1) 2] + E[ e SARGE k e SARGE k 1 2]

n)E[ e SARGE k 1 f(xk 1) 2] + E[ e SARGE k e SARGE k 1 2].

Finally, we bound the last term on the right using Lemma 37.

E e SARGE k f(xk) 2

n + 3 2n2 )E[ e SARGE k 1 f(xk 1) 2] + 12

i=1 E[ fi(xk) fi(xk 1) 2]

ℓ=1 (1 1 2n)k ℓ 1 Pn

i=1 E[ fi(xℓ) fi(xℓ 1) 2]

This completes the proof.

Proof of Lemma 28 It is easy to see that ρB = 1/n by computing the expectation of the SARGE gradient estimator.

f(xk) Ek[e SARGE k ] = f(xk) Ek[e SAGA k (1 1

n)(e ξ-SAGA k e SARGE k 1 )]

n)( f(xk 1) e SARGE k 1 ).

The result of Lemma 38 makes it clear that M1 = 12. To determine ρM, we must ﬁrst choose a suitable sequence Mk. Let Mk = E[ e SARGE k f(xk) 2]. If n = 1, then Mk = 0 for all k, so it holds trivially that Mk (1 ρM)Mk 1. If n 2, then 1 1

n + 3 2n2 1 1

4n, so Lemma 38 ensures that with ρM = 1 4n, Mk (1 ρM)Mk 1. Finally, we must compute M2 and ρF with respect to some sequence Fk. Lemma 38 motivates the choice

Fk = Pk 1 ℓ=1 (1 1

2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2],

and the choices M2 = 39

n and ρF = 1 2n are clear.

Appendix G. Incorporating Bias to Lower the MSE: An Example

From (5), we argue that biased stochastic gradient estimators can admit better convergence guarantees than unbiased estimators if the bias reduces the total eﬀect of the estimator s MSE and inner-product bias term. In this section, we compare the eﬀects of these terms for the SARAH and SVRG gradient estimators, revealing why SARAH admits better convergence rates. Beginning with the SARAH estimator, equation (5) reduces to (17), which we reproduce:

ηEk[F(xk+1) F(x ) + δ(F(xk+1) F(xk))]

2Ek[ xk+1 x 2] + 1

2 xk x 2 + Ek h η(1+δ)

2Lλ e k f(xk) 2

+ (1 + δ)( ηL(λ+1)

2) xk+1 xk 2 + η(1 ρB) f(xk 1) e k 1, xk x i .

Driggs, Liang, Sch onlieb

Following the proof of Theorem 18, we set λ =

2Θ, which optimizes the step size. The eﬀect of the MSE and bias term over an epoch is then

k=ms+1 η(1+δ) 2L

2ΘE[ f(xk) e k 2 + η(1 ρB) f(xk 1) e k 1, xk x ]

1 Pm(s+1) 1

k=ms+1 E[( η(1+δ)

2Θ + mϵη(1 ρB)

2 ) f(xk) e k 2 + mη(1 ρB)

2ϵ xk+1 xk 2]

2 Pm(s+1) 1

k=ms+1 E[( η(1+δ)

2Θ + mϵη(1 ρB)

i=1 fi(xk+1) fi(xk) 2 + mη(1 ρB)

2ϵ xk+1 xk 2]

3 Pm(s+1) 1

k=ms+1 E[L2m( η(1+δ)

2Θ + mϵη(1 ρB)

2 ) xk+1 xk 2 + mη(1 ρB)

2ϵ xk+1 xk 2]

Inequality 1 is an application of Lemma 15, 2 is Lemma 27, and 3 uses the Lipschitz continuity of fi. Setting ϵ = (L2n) 1/2 to minimize this bound, setting δ = m 1, and choosing m = O(n) gives a coeﬃcient of O(n3/2Lη). In contrast, B-SVRG admits a larger bound on these terms. For memory-biased estimators, equation (5) reduces to (c.f. (11))

ηEk[F(xk+1) F(x ) + δ(F(xk+1) F(xk))]

2Lλ Ek[ e k f(xk) 2] 1

2Ek[ xk+1 x 2] + 1

+ ( ηL(1+δ)(λ+1)

2 )Ek[ xk+1 xk 2] + ηL

i=1 xk ϕi k 2.

The bias term leads to terms of the form xk ϕi k 2. Therefore, using λ =

2Θ, our goal is to minimize

k=ms E[ η(1+δ)

2Θ f(xk) e k 2 + ηL

i=1 xk ϕi k 2]

2ΘMms + Pm(s+1) 1

i=1 xk ϕi k 2]

2 Pm(s+1) 1

k=ms E[ M1ηL(1+δ)

2Θ xk+1 xk 2 + ηL

i=1 xk ϕi k 2]

3 Pm(s+1) 1

k=ms E[ M1ηL(1+δ)

2Θ xk+1 xk 2 + ηLB1

θ)E xk+1 xk 2].

Inequalities 1 and 2 use the fact that the SVRG estimator satisﬁes the BMSE property with ρM = 1. Inequality 3 uses the deﬁnition of a memory-biased estimator. With δ = max{B1(1 1/θ)/

2Θ 1, 0}, B1 = 3m(m + 1), Θ = M1, and

θ2 θ (0, 2], 3m(m + 1)(1 1

this gives a coeﬃcient of O(Lm2η). This diﬀerence of O(n1/2) between the two bounds is signiﬁcant. It manifests as an improvement of n in the convergence rate of SARAH over the rate of B-SVRG.

Zeyuan Allen-Zhu. Natasha: Faster non-convex stochastic optimization via strongly nonconvex parameter. In Proceedings of the 34th International Conference on Machine Learning, 2017.

On Biased Stochastic Gradient Estimation

Zeyuan Allen-Zhu. Katyusha: The ﬁrst direct acceleration of stochastic gradient methods. Journal of Machine Learning Research, pages 1 51, 2018a.

Zeyuan Allen-Zhu. Katyusha X: Practical momentum method for stochastic sum-ofnonconvex optimization. In Proceedings of the 35th International Conference on Machine Learning, 2018b.

Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, 2018c.

Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for non-strongly-convex or sum-ofnon-convex objectives. In Proceedings of the 35th International Conference on Machine Learning, 2018.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006.

L eon Bottou, Frank E. Curtis, , and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60:223 311, 2018.

Kristian Bredies and Dirk Lorenz. Mathematical Image Processing. Springer, 2018.

Antonin Chambolle, Matthias J. Ehrhardt, Peter Richt arik, and Carola-Bibiane Sch onlieb. Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim., 28(4):2783 2808, 2018.

Aaron Defazio. A simple practical accelerated method for ﬁnite sums. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, 2016.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646 1654, 2014a.

Aaron Defazio, Tiberio Caetano, and Justin Domke. Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the 31st International Conference on Machine Learning, 2014b.

Derek Driggs, Matthias Ehrhardt, and Carola-Bibiane Sch onlieb. Accelerating variancereduced gradient methods. Mathematical Programming, 2020.

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal nonconvex optimization via stochastic path integrated diﬀerential estimator. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, 2018.

Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ar Xiv:1509.05647v4, 2015.

Driggs, Liang, Sch onlieb

Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian Mc Williams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, volume 28, pages 2296 2304, 2015.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013.

Guanghui Lan, Zhize Li, and Yi Zhou. A uniﬁed variance-reduced accelerated gradient method for convex optimization. ar Xiv:1905.12412, 2019.

Jingwei Liang, Jalal Fadili, and Gabriel Peyr e. Activity identiﬁcation and local linear convergence of Forward Backward-type methods. SIAM Journal on Optimization, 27(1): 408 437, 2017.

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for ﬁrst-order optimization. In Advances In Neural Information Processing Systems, 2015.

Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964 979, 1979.

Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. Technical report, 2014.

Yurii Nesterov. Introductory lectures on convex programming. Springer, 2004.

Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak aˆc. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2613 2621, 2017.

Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran-Dinh. Prox SARAH: An eﬃcient algorithmic framework for stochastic composite nonconvex optimization. ar Xiv:1902.05679, 2019.

Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnab as P oczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proc. 33rd International Conference on Machine Learning, 2016a.

Sashank J. Reddi, Suvrit Sra, Barnab as P oczos, and Alex Smola. Fast stochastic methods for nonsmooth nonconvex optimization. In Proc. 30th Annual Conference on Neural Information Processing Systems, 2016b.

Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22(3):400 407, 1951.

Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. In Advances in neural information processing systems, pages 2663 2671, 2012.

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing ﬁnite sums with the stochastic average gradient. Mathematical Programming, 162:83 112, 2017.

On Biased Stochastic Gradient Estimation

Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567 599, 2013.

Fanhua Shang, Licheng Jiao, Kaiwen Zhou, James Cheng, Yan Ren, and Yufei Jin. ASVRG: Accelerated proximal SVRG. In Asian Conference on Machine Learning, volume 95, pages 1 32, 2018.

Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spider Boost: A class of faster variance-reduced algorithms for nonconvex optimization. ar Xiv:1810.10690, 2018.

Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, 2019.

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. Technical report, Microsoft Research, 2014.

Kaiwen Zhou, Fanhua Shang, and James Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In Proceedings of the 35th International Conference on Machine Learning, pages 5975 5984, 2018.