# on_biased_stochastic_gradient_estimation__99520dd8.pdf Journal of Machine Learning Research 23 (2022) 1-43 Submitted 4/20; Revised 7/21; Published 2/22 On Biased Stochastic Gradient Estimation Derek Driggs d.driggs@maths.cam.ac.uk Department of Applied Mathematics and Theoretical Physics University of Cambridge Cambridge, CB3 0WA, UK Jingwei Liang jingwei.liang@sjtu.edu.cn Institute of Natural Sciences and School of Mathematical Sciences Shanghai Jiao Tong University Shanghai, 200240,China Carola-Bibiane Sch onlieb cbs31@cam.ac.uk Department of Applied Mathematics and Theoretical Physics University of Cambridge Cambridge, CB3 0WA, UK Editor: Julien Mairal We present a uniform analysis of biased stochastic gradient methods for minimizing convex, strongly convex, and non-convex composite objectives, and identify settings where bias is useful in stochastic gradient estimation. The framework we present allows us to extend proximal support to biased algorithms, including SAG and SARAH, for the first time in the convex setting. We also use our framework to develop a new algorithm, Stochastic Average Recursive Gradi Ent (SARGE), that achieves the oracle complexity lower-bound for nonconvex, finite-sum objectives and requires strictly fewer calls to a stochastic gradient oracle per iteration than SVRG and SARAH. We support our theoretical results with numerical experiments that demonstrate the benefits of certain biased gradient estimators. Keywords: stochastic gradient descent, variance reduction, biased gradient estimation 1. Introduction In this paper, we focus on the following composite minimization problem: min x Rp F(x) def = f(x) + g(x) . (1) Throughout, we assume: g is proper and closed such that its proximity operator (see (3) in Section 2) is welldefined, f admits a finite-sum structure f(x) def = 1 n Pn i=1 fi(x), and for all i {1, 2, , n}, fi is L-Lipschitz continuous for some L > 0. We consider three settings: the convex setting, where all of {fi}n i=1 and g are convex; the strongly convex setting, where additionally g is strongly convex; and the non-convex setting, where {fi}n i=1 and g are not necessarily convex. . Corresponding author 2022 Derek Driggs, Jingwei Liang, Carola-Bibiane Sch onlieb. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-316.html. Driggs, Liang, Sch onlieb Problems of this form arise frequently in many areas of science and engineering, such as machine learning, statistics, operations research, and imaging. For instance, in machine learning, these problems often arise as empirical risk minimization problems from classification and regression tasks. Examples include ridge regression, logistic regression, Lasso, and ℓ1-regularized logistic regression (Bishop, 2006). Principal component analysis (PCA) can also be formulated as a problem with this structure, where the functions fi are non-convex (Garber and Hazan, 2015; Allen-Zhu and Yuan, 2018). In imaging, ℓ1 or total variation regularization is often combined with differentiable data discrepancy terms that appear in both convex and non-convex instances (Bredies and Lorenz, 2018). 1.1 Stochastic gradient methods Two classical approaches to solve (1) are the proximal gradient descent method (PGD) (Lions and Mercier, 1979) and its accelerated variants, including inertial PGD (Liang et al., 2017) and FISTA (Beck and Teboulle, 2009). For these deterministic approaches, the full gradient of f must be evaluated at each iteration, which often requires huge computational resources when n is large. Such a drawback makes these schemes unsuitable for large-scale machine learning tasks. By exploiting the finite sum structure of f, stochastic gradient methods enjoy low periteration complexity while achieving comparable convergence rates. These qualities make stochastic gradient methods the standard approach to solving many problems in machine learning, and are gaining popularity in other areas such as image processing (Chambolle et al., 2018). Stochastic gradient descent (SGD) was first proposed in the 1950 s (Robbins and Monro, 1951) and has experienced a renaissance in the past decade, with numerous variants of SGD proposed in the literature (see, for instance (Schmidt et al., 2017; Johnson and Zhang, 2013; Defazio et al., 2014a) and references therein). Most of these algorithms can be summarized into one general form, which is described below in Algorithm 1. Algorithm 1 Stochastic gradient descent framework Input: starting point x0 Rp, gradient estimator e . 1: for k = 0, 1, , T 1 do 2: Compute the stochastic gradient estimate e k at the current iterate xk. 3: Choose the step size/learning rate ηk. 4: Update xk+1: xk+1 proxηkg(xk ηk e k). (2) Below we summarize several popular stochastic gradient estimators e k: SGD Classic stochastic gradient descent (Robbins and Monro, 1951) uses the following gradient estimator at iteration k: $ sample jk uniformly at random from {1, ..., n}, e SGD k = fjk(xk). On Biased Stochastic Gradient Estimation At each step, SGD uses the gradient of the sampled function fjk(xk) as a stochastic approximation of the full gradient f(xk). It is an unbiased estimate as Ek[ fjk(xk)] = f(xk). It is also memoryless: every update of xk+1 depends only upon xk and the random variable jk. The variance of SGD is does not vanish as the algorithm converges. SAG To deal with the non-vanishing variance of SGD, in (Roux et al., 2012; Schmidt et al., 2017) the authors introduce the SAG gradient estimator, which uses the gradient history to decrease its variance. With fi(ϕi 0) = 0, i = 1, ..., n, the SAG gradient estimator is computed using the following procedure: sample jk uniformly at random from {1, ..., n}, e SAG k = 1 n( fjk(xk) fjk(ϕjk k )) + 1 Pn i=1 fi(ϕi k), update the gradient history: fi(ϕi k+1) = fi(xk) if i = jk, fi(ϕi k) o.w. Here, for each i {1, ..., n}, fi(ϕi k) is a stored gradient of fi from a previous iteration. With the help of memory, the variance of the SAG gradient estimator diminishes as the algorithm converges. Estimators that satisfy this property are known as variance-reduced estimators. In contrast to the SGD estimator, e SAG k is a biased estimate of f(xk). SAGA Building on (Roux et al., 2012; Schmidt et al., 2017), Defazio et al. (2014a) propose the unbiased gradient estimator SAGA, which is computed using the procedure below. Sample jk uniformly at random from {1, ..., n}, e SAGA k = fjk(xk) fjk(ϕjk k ) + 1 Pn i=1 fi(ϕi k), update the gradient history : fi(ϕi k+1) = ( fi(xk) if i = jk, fi(ϕi k) o.w. Compared to e SAG, the SAGA estimator gives less weight to stored gradients. With this adjustment, e SAGA is unbiased while maintaining the variance reduction property. Similar gradient estimators can be found in Point-SAGA (Defazio, 2016), Finito (Defazio et al., 2014b), MISO (Mairal, 2014), SDCA (Shalev-Shwartz and Zhang, 2013), and those in (Hofmann et al., 2015). SVRG Another popular variance-reduced estimator is SVRG (Johnson and Zhang, 2013). The SVRG gradient estimator is computed as follows: For s = 0, , S n Pn i=1 fi(ϕs), For k = 1, , m $ Sample jk uniformly at random from {1, , n}, e SVRG k = fjk(xk) fjk(ϕs) + f(ϕs), where ϕs is a snapshot point updated every m steps. The algorithms prox-SVRG (Xiao and Zhang, 2014), Natasha (Allen-Zhu, 2017), Katyusha (Allen-Zhu, 2018a), Driggs, Liang, Sch onlieb Katyusha X (Allen-Zhu, 2018b), Natasha2 (Allen-Zhu, 2018c), Mi G (Zhou et al., 2018), ASVRG (Shang et al., 2018), and VARAG (Lan et al., 2019) use the SVRG gradient estimator. SARAH In (Nguyen et al., 2017) the authors propose a recursive modification to SVRG. For s = 0, , S e SARAH k 1 = 1 n Pn i=1 fi(ϕs), For k = 1, , m $ Sample jk uniformly at random from {1, , n}, e SARAH k = fjk(xk) fjk(xk 1) + e SARAH k 1 , Like SAG, SARAH is a biased gradient estimator. It is also used in prox-SARAH (Pham et al., 2019), SPIDER (Fang et al., 2018), SPIDERBoost (Wang et al., 2018) and SPIDER-M (Wang et al., 2019). We refer to algorithms employing (un)biased gradient estimators as (un)biased stochastic algorithms, respectively. The body of work on biased algorithms is stunted compared to the enormous literature on unbiased algorithms, leading to several gaps in the development of biased stochastic gradient methods. We list a few below. Complex convergence proofs. It is often difficult to analyze biased stochastic algorithms. The convergence proof of the biased algorithm SAG is especially complex, requiring computational verification (Roux et al., 2012; Schmidt et al., 2017). Sub-optimal convergence rates. In the convex setting with g 0, SARAH achieves a complexity bound of O(log(1/ϵ) ϵ ) (Nguyen et al., 2017) for finding a point xk such that E[F( xk) F(x )] ϵ. In comparison, SAGA and SVRG achieve a complexity bound of O(1/ϵ) which is the same as deterministic proximal gradient descent. Lack of proximal support. Bias also makes it difficult to handle non-smooth functions. To the best of our knowledge, there are no theoretical guarantees for biased algorithms to solve (1) with g 0 that take advantage of convexity when it is present. Despite the above shortcomings, there are notable exceptions that suggest biased algorithms are worth further consideration. Recently, (Pham et al., 2019; Fang et al., 2018; Wang et al., 2018, 2019) proved that algorithms using the SARAH gradient estimator require O( n/ϵ2) stochastic gradient evaluations to find an ϵ-first-order stationary point. This matches the complexity lower-bound for non-convex, finite-sum optimization for smooth functions fi and n O(ϵ 4) (Fang et al., 2018). For comparison, the best complexity bound obtained for SAGA and SVRG in this setting is O(n2/3/ϵ2) (Reddi et al., 2016a; Allen-Zhu and Hazan, 2016), and this performance requires large mini-batches of size O(n2/3). A detailed summary of existing complexity bounds for the variance-reduced gradient estimators mentioned above is provided in Tables 1 and 2 for convex and non-convex objectives, respectively. The complexity bound for SAGA (Defazio et al., 2014a), SVRG (Johnson and Zhang, 2013) and SAG (Schmidt et al., 2017) on convex objectives is O(n+L ϵ ), which can be improved to O((n+κ) log(1/ϵ)) when strong convexity is present. For smooth objectives (where g 0), this complexity is O( n L ϵ ) for SAGA and SVRG (Reddi et al., 2016a), while SAG requires g 0 for any convergence results. Convergence results for SARAH On Biased Stochastic Gradient Estimation Convex Convex and g 0 Strongly Convex Prox-Support? ϵ ) O((n + κ) log(1/ϵ)) Yes ϵ ) O((n + κ) log(1/ϵ)) Yes SAG None O(n+L ϵ ) O((n + κ) log(1/ϵ)) No SARAH None O(n+L log(1/ϵ) ϵ ) O((n + κ) log(1/ϵ)) No Table 1: Existing complexity bounds for stochastic gradient estimators for convex objectives. Complexities with a represent the number of stochastic gradient oracle calls required to find an ϵ-approximate stationary point (as in Definition 7). The other complexities are for finding a point satisfying E[F( xk) F(x )] ϵ in the convex case and E[ xk x 2] ϵ in the strongly convex case. The parameter µ is the strong convexity constant, and κ = L/µ is the condition number. on convex objectives also require g 0, and the proven complexity is worse than similar results for SAGA, SVRG, and SAG by a logarithmic factor (Nguyen et al., 2017). There are several accelerated algorithms that achieve better convergence rates than those in Table 1. SVRG++ achieves a complexity of O(n log(1/ϵ) + L/ϵ) on convex objectives using an epoch-doubling procedure (Allen-Zhu and Yuan, 2018). Katyusha, an accelerated variant of SVRG, has complexities of O(n log(1/ϵ) + p n L/ϵ) on convex objectives and O(n + nκ log(1/ϵ)) with strong convexity. Combining a variance-reduced algorithm with the Catalyst acceleration scheme produces algorithms with the same convergence rates up to logarithmic factors (Lin et al., 2015). These accelerated algorithms are not directly comparable to the non-accelerated algorithms in this paper, so we leave these rates out of Table 1. The algorithms considered in this work can be accelerated using momentum schemes as well, and this is the subject of a related work (Driggs et al., 2020). On non-convex objectives, SAGA and SVRG achieve complexities of O(n L ϵ2 ), and this rate can be improved to O(n2/3L ϵ2 ) using large mini-batches of size O(n2/3) (Reddi et al., 2016b). Although we do not consider mini-batching in this work, using large mini-batch sizes could similarly improve the presented complexities for B-SAGA and B-SVRG. SAG has not been previously analyzed in the non-convex setting, so this work presents the first convergence results for SAG in this setting as a special case of our results for B-SAGA. SARAH achieves the oracle complexity lower-bound of O( n L ϵ2 ) (Pham et al., 2019). Table 3 summarises the convergence rates provided in this work. Our results provide proximal support to biased algorithms such as SARAH and SAG for the first time, prove state-of-the-art convergence rates for all algorithms in the non-convex setting, and improve the best-known convergence rates for SARAH on convex objectives. These strong results for non-smooth, non-convex problems and biased estimators comes at the cost of recovering suboptimal convergence rates for SAGA and SVRG on convex problems. 1.2 Contributions This work provides three main contributions: Driggs, Liang, Sch onlieb Non-Convex, No Mini-Batching With Mini-Batching Prox-Support? ϵ2 ) O(n2/3L ϵ2 ) O(n2/3L SAG None None No SARAH O( n L ϵ2 ) O( n L Table 2: Existing complexity bounds for stochastic gradient estimators for non-convex optimization. These complexities represent the number of stochastic gradient oracle calls required to find an ϵ-approximate stationary point (as in Definition 7). Using mini-batches of size n2/3 optimizes the complexity of SAGA and SVRG in this setting. While we do not consider mini-batching in this work, this improvement likely extends to B-SAGA and B-SVRG as well. 1. We introduce a framework for the systematic analysis of a large class of stochastic gradient methods and investigate a bias-variance tradeoffarising from our analysis. Our analysis provides proximal support to biased algorithms for the first time in the convex setting. 2. We apply our framework to derive convergence rates for SARAH and biased versions of SAGA and SVRG on convex, strongly convex, and non-convex problems. 3. We design a new recursive gradient estimator, Stochastic Average Recursive Gradi Ent (SARGE), that achieves the same convergence rates as SARAH but never computes a full gradient, giving it a strictly smaller per-iteration complexity than SARAH. In particular, we show that SARGE achieves the oracle complexity lower bound for non-convex finite-sum optimization. To study the effects of bias on the SAGA and SVRG estimators, we introduce Biased SAGA (B-SAGA) and Biased SVRG (B-SVRG). For θ > 0, these two gradient estimators read B-SAGA: e B-SAGA k def = 1 θ( fjk(xk) fjk(ϕjk k )) + 1 Pn i=1 fi(ϕi k). B-SVRG: e B-SVRG k def = 1 θ( fjk(xk) fjk(ϕs)) + f(ϕs). In both B-SAGA and B-SVRG, the bias parameter θ adjusts how much weight is given to stored gradient information. When θ = n, e B-SAGA k recovers the SAG gradient estimator. Motivated by the desirable properties of SARAH, we propose a new gradient estimator, Stochastic Average Recursive Gradi Ent (SARGE), which is defined below e SARGE k def = fjk(xk) ψjk k + 1 Pn i=1 ψi k (1 1 n)( fjk(xk 1) e SARGE k 1 ), where the variables ψi k follow the update rule ψjk k+1 = fjk(xk) (1 1 n) fjk(xk 1). Similar to SAGA, SARGE uses stored gradient information to avoid having to compute the full gradient, a computational burden that SVRG and SARAH require for variance reduction. A summary of the complexity results obtained from our analysis for SAG/B-SAGA, B-SVRG, SARAH, and SARGE are provided in Table 3. Note that the result for SAG is included in B-SAGA. On Biased Stochastic Gradient Estimation Convex Strongly Convex Non-Convex Prox-Support? B-SAGA O(n L ϵ ) O(nκ log(1/ϵ)) O(n L B-SVRG O(n L ϵ ) O(nκ log(1/ϵ)) O(n L SARAH O( n L+ Ln3/4 ϵ ) O(max{ nκ, n} log(1/ϵ)) O( n L SARGE O( n L+ Ln3/4 ϵ ) O(max{ nκ, n} log(1/ϵ)) O( n L Table 3: Complexity bounds obtained from our analytical framework. These complexities represent the number of stochastic gradient oracle calls required to find a point xk satisfying E[F( xk) F(x )] ϵ for the convex case, E[ xk x 2] ϵ for the strongly convex case, and an ϵ-approximate stationary point in the non-convex case. While we do not recover the best-known rates for (unbiased) SAGA and SVRG in the convex setting, our rates for B-SAGA, B-SVRG, SARAH, and SARGE are the first known for this problem class, and our rates for SARAH and SARGE are better than the best-known rates for SAGA and SVRG in the convex setting. Paper organization Preliminary results and notations are provided in Section 2. A discussion on the tradeoffbetween bias and variance in stochastic optimization is included in Section 3. Our main convergence results are presented in Section 4. In Section 5, we substantiate our theoretical results using numerical experiments involving several classic regression tasks arising from machine learning. All the proofs of the main results are collected in the appendix. 2. Preliminaries and notations Throughout the paper, Rp is a p-dimensional Euclidean space equipped with scalar inner product , and associated norm || ||. The sub-differential of a proper closed convex function g is the set-valued operator defined by g(x) def = v Rn|g(x ) g(x) + v, x x , x Rn , the proximity, or proximal map of g is defined as proxηg(y) def = arg minx Rn ηg(x) + 1 2||x y||2, (3) where η > 0 and y Rp. Below we summarize some useful results in convex analysis. Lemma 1 (Nesterov (2004, Thm 2.1.5)) Suppose f is convex with an L-Lipschitz continuous gradient. We have for every x, u Rp, f(x) f(u) 2 2L(f(x) f(u) f(u), x u ). When f is a finite sum as in (1), Lemma 1 is equivalent to the following result. Lemma 2 Let f(x) = 1 n Pn i=1 fi(x), where each fi is convex with an L-Lipschitz gradient. Then for every x, u Rp, 1 2Ln Pn i=1 fi(x) fi(u) 2 f(x) f(u) f(u), x u . Driggs, Liang, Sch onlieb Lemma 2 is obtained by applying Lemma 1 to each fi and averaging. Lemma 3 Suppose g is µ-strongly convex with µ 0, and let z = proxηg(x ηd) for some x, d Rp and η > 0. Then, for any y Rp, 2 x y 2 1+µη 2 z x 2 ηg(z) + ηg(y). Proof By the strong convexity of g, we have g(z) g(y) ξ, z y µ 2 z y 2, ξ g(z). From the definition of the proximal operator, we have that 1 η(x z) d g(z). Therefore, g(z) g(y) ξ, z y µ η x z ηd, z y µ = d, z y + 1 η x z, z y µ 2η||x z||2 1 2η||z y||2 + 1 2η||x y||2 µ Multiplying by η and rearranging yields the assertion. The next lemma is an analogue of the descent lemma for gradient descent when the gradient is replaced with an arbitrary vector d. Lemma 4 Suppose g is µ-strongly convex for µ 0, and let z = proxηg(x ηd). The following inequality holds for any λ > 0. 0 η(F(x) F(z)) + η 2Lλ d f(x) 2 + (ηL(λ+1) Proof This follows immediately from Lemma 3. 0 = η d, x z + η d, z x 1 η d, x z 2+µη 2 z x 2 + η(g(x) g(z)) = η f(x), x z + η d f(x), x z 2+µη 2 z x 2 + η(g(x) g(z)) 2 η(F(x) F(z)) + η d f(x), x z + (ηL 3 η(F(x) F(z)) + η 2Lλ d f(x) 2 + (ηL(λ+1) Inequality 1 is due to Lemma 3 with y = x, 2 is due to the Lipschitz continuity of fi, and 3 is Young s. The previous two lemmas require that g to be convex. Similar results hold in the nonconvex case as well. Lemma 5 Let z proxηg(x ηd) for some x, d Rp and η > 0. Then, for any y Rp, 2 z x 2 ηg(z) + ηg(y). On Biased Stochastic Gradient Estimation Proof By the definition of z, z arg min v 2η v x 2 + g(v) . Let v = y, then g(z) g(y) d, y z + 1 2η x y 2 x z 2 . Multiplying by η completes the proof. Lemma 6 Let z proxηg(x ηd). Then F(z) F(y) + f(x) d, z y + (L 2η) x z 2 + (L Proof By the Lipschitz continuity of f, we have the inequalities f(x) f(y) f(x), x y + L f(z) f(x) f(x), z x + L Furthermore, by Lemma 5, g(z) g(y) d, y z + 1 2η x y 2 x z 2 . Adding these inequalities together completes the proof. In the non-convex setting, to measure convergence of the sequence to a first-order stationary point, we use the notion of a generalized gradient (Nesterov, 2004). Definition 7 (generalized gradient map) The generalized gradient map is defined as Gη(xk) def = 1 η(xk proxηkg(xk η f(xk))). For any η > 0. A point x satisfying Gη(x) = 0 is a first-order stationary point of f + g, and an ϵ-first-order stationary point is a point satisfying Gη(x) ϵ. When g 0, we have Gηk(xk) = f(xk) 0 if the sequence {xk} converges to some x Rp such that f(x ) = 0. For nontrivial g, suppose infk ηk > 0 and xk converges to some x such that x proxηg(x η f(x )), then Gηk(xk) 0. 3. A bias-variance tradeoffin stochastic gradient methods In this section, we discuss the effect of bias and variance of a stochastic gradient estimator on the performance of Algorithm 1. It is elementary that the mean-squared error (MSE) of a stochastic estimator can be decomposed into the sum of its variance and squared bias. In our setting, we have Ek[ e k f(xk) 2] = Ek[e k] f(xk) 2 + Ek[ e k Ek[e k] 2]. This decomposition shows that a biased estimator might have a smaller MSE than an unbiased estimator as long as the bias sufficiently diminishes the variance. This is the bias-variance tradeoff. As we see below, a bias-variance tradeoffexists in our analysis of stochastic gradient methods, but with a slightly different form. In what follows, we first discuss the bias-variance tradeoffin the convex settings and then the non-convex setting. Driggs, Liang, Sch onlieb 3.1 The convex case Let x be a global minimizer of problem (1). From the update (2), let wk+1 g(xk+1). We have the following bound on the suboptimality at xk+1: Ek[F(xk+1) F(x )] = Ek[f(xk+1) f(xk) + f(xk) f(x ) + g(xk+1) g(x )] 2 Ek[ xk+1 xk 2] + Ek[ f(xk), xk+1 xk ] + f(xk), xk x + Ek[g(xk+1) g(x )] 2 Ek[ xk+1 xk 2] + Ek[ f(xk) e k, xk+1 xk ] + Ek[ f(xk) e k, xk x ] + Ek[ e k, xk+1 x ] + Ek[g(xk+1) g(x )] 2Ek[ f(xk) e k 2] + (L 2ϵ)Ek[ xk+1 xk 2] + Ek[ f(xk) e k, xk x ] + Ek[ e k + wk+1, xk+1 x µ 2 xk+1 x 2] 2Ek[ f(xk) e k 2] + (L 2η)Ek[ xk+1 xk 2] + Ek[ f(xk) e k, xk x ] 1+µη 2η Ek[ xk+1 x 2] + 1 2η xk x 2. (4) Inequality 1 follows from the convexity of f and Lipschitz continuity of f, 2 follows from the (strong) convexity of g, and 3 comes from the implicit definition of the proximal operator (3). For the last line of (4), we observe that the inner product term Ek[ f(xk) e k, xk x ] vanishes when e k is an unbiased estimate of f(xk). When the estimator is biased, we must develop a new way to control this term, together with Ek[|| f(xk) e k||2]. Hence, the following terms arise in our convergence analysis from the bias and the variance of the gradient estimator: Bias : Ek[ f(xk) e k, xk x ] and Ek[e k] f(xk) 2, Variance : Ek[ e k Ek[e k] 2]. (5) An effective gradient estimator minimizes all three of these terms simultaneously. As we later show in our MSE and bias-term bounds, SARAH and SARGE minimize these terms more effectively than biased SAGA and SVRG estimators, leading to better convergence rates. We provide an explicit comparison between SARAH and SVRG in Appendix G. Remark 8 (Non-composite case g = 0) When g = 0, for gradient descent, the descent property of f yields f(xk+1) f(x ) (L η) xk+1 xk 2 + f(xk) f(x ), where η 2/L. For stochastic gradient descent, we obtain the following relationship: Ek[f(xk+1) f(x )] = Ek[f(xk+1) f(xk) + f(xk) f(x )] Ek[ f(xk) e k, xk+1 xk ] + (L η)Ek[ xk+1 xk 2] + f(xk) f(x ) 2Ek[ f(xk) e k 2] + (L η)Ek[ xk+1 xk 2] + f(xk) f(x ). On Biased Stochastic Gradient Estimation Compared to (4), there is no inner product term in (6), which makes the analysis of the non-composite case much simpler. This is one reason why biased algorithms have been successfully studied in non-composite setting, but not in the composite setting. 3.2 The non-convex case The influence of bias is simpler in the non-convex setting and independent of g, which explains why biased algorithms have recently found success for these problems. To begin, let ˆxk+1 proxηg/2(xk η 2 f(xk)). Applying Lemma 6 with z = ˆxk+1, y = x = xk and d = f(xk), we have F(ˆxk+1) F(xk) + ( L η) ˆxk+1 xk 2. Again, applying Lemma 6 with z = xk+1, y = ˆxk+1, x = xk, and d = e k, we obtain F(xk+1) F(ˆxk+1) + f(xk) e k, xk+1 ˆxk+1 2 1 2η) xk+1 xk 2 + ( L 2 + 1 2η) ˆxk+1 xk 2 Adding these two inequalities together gives F(xk+1) F(xk) + (L 1 2η) ˆxk+1 xk 2 + (L 2η) xk+1 xk 2 + f(xk) e k, xk+1 ˆxk+1 1 F(xk) + (L 1 2η) ˆxk+1 xk 2 + (L 2η) xk+1 xk 2 + 2η f(xk) e k 2 8η ˆxk+1 xk+1 2 2 F(xk) + (L 1 4η) ˆxk+1 xk 2 + (L 4η) xk+1 xk 2 + 2η f(xk) e k 2. (7) Inequality 1 is Young s, and 2 is the standard inequality a c 2 2 a b 2 + 2 b c 2. In the non-convex case, the inner-product bias term does not appear, so the bias-variance tradeoffis the classical one. 3.3 General bounds on bias and variance To ensure convergence of Algorithm 1 using a particular gradient estimator, we must bound the inner-product bias term, Ek[ f(xk) e k, xk x ], and the MSE, Ek[ f(xk) e k 2]. Below we introduce general bounds on these terms that allow us to establish convergence rates for a variety of gradient estimators. The first of these is a bound on the MSE term. Definition 9 (Bounded MSE) The stochastic gradient estimator e is said to satisfy the BMSE(M1, M2, ρM, ρF , m) property with parameters M1, M2 0, ρM, ρF (0, 1] and m 1 if there exist sequences Mk and Fk such that Pm(s+1) 1 k=ms E[ e k f(xk) 2] Mms, and the following bounds hold: Mms (1 ρM)m Mm(s 1) + Fms + M1 Pm(s+1) 1 k=ms Pn i=1E[ fi(xk+1) fi(xk) 2], Fms Ps ℓ=0 M2(1 ρF )m(s ℓ) Pm(s+1) 1 k=ms Pn i=1E[ fi(xk+1) fi(xk) 2]. Driggs, Liang, Sch onlieb The constant m is the epoch length of the gradient estimator, hence it is usually set to be O(n). This property is useful in convergence analyses because it bounds the MSE by a geometrically decaying sequence {Mmk}k N and a component that is proportional to the one-iteration progress of gradient descent (1/n Pn i=1 fi(xk+1) fi(xk) 2). Most variance-reduced stochastic gradient estimators satisfy the BMSE property, including SAG, SAGA, SVRG, SARAH, and all the estimators considered by Hofmann et al. (2015). SGD does not satisfy this property, as its variance does not decay along the iterations. Most existing work on the analysis of general stochastic gradient algorithms enforce bounds of this form on either the MSE or the moments of the stochastic estimator, with the crucial difference that existing works require the bounds to be Markovian (that is, dependent on only the previous iteration) (Bottou et al., 2018). In contrast, the BMSE property allows non-Markovian MSE bounds through the sequence Fk. This relaxation is crucial for the analysis of our new gradient estimator, SARGE. In order to bound the inner-product bias term, we require the gradient estimator to admit a certain structure in its bias. In biased estimators such as SAG, the bias depends on the stored gradient values: f(xk) Ek[e SAG k ] = (1 1 n Pn i=1 fi(ϕi k) . We call estimators whose bias admits the above structure memory-biased gradient estimators. These include SAG, and more generally B-SAGA and B-SVRG. Definition 11 (Memory-biased gradient estimator) The stochastic gradient estimator e is memory-biased with parameters θ > 0, B1 0, and m 1 if f(xk) Ek[e k] = (1 1 Pn i=1 fi(ϕi k) , for some {ϕi k}n i=1 {xℓ}k 1 ℓ=0 , and for any s N0, Pm(s+1) 1 k=ms 1 n Pn i=1 E[ xk ϕi k 2] B1 Pm(s+1) 1 k=ms E[ xk xk 1 2]. (8) B-SAGA is clearly a memory-biased estimator, and so is B-SVRG where ϕi k = ϕi ms for all k in epoch s. The parameter θ controls the amount of bias in the estimator, and B1, in a sense, measures how stale the stored gradient information is. For memory-biased gradient estimators, the bias-term can be bounded by terms of the form xk ϕi k 2. Lemma 12 Suppose e is memory-biased with parameter θ 1 and that F is µ-strongly convex with µ 0. For any λ > 0, the following inequality holds: ηEk[F(xk+1) F(x )] η 2LλEk[ e k f(xk) 2] 1+µη 2 Ek[ xk+1 x 2] + 1 + ( ηL(λ+1) 2)Ek[ xk+1 xk 2] + ηL i=1 xk ϕi k 2. On Biased Stochastic Gradient Estimation The proof of Lemma 12 can be found in Appendix A. The bound of Lemma 12 is analogous to the bound in (4), but the inner-product bias term is replaced with ηL θ) Pn i=1 xk ϕi k 2. This term is proportional to the progress of gradient descent (by (8)), so this provides the necessary control over the inner-product bias term. Remark 13 Lemma 12 requires that θ 1, so the rates we derive for the convex setting hold only for θ 1. However, the convergence rate we prove in Section 4.1.2 for the non-convex setting, which allow θ (0, 1), also hold for convex problems. In summary, our results guarantee convergence for all θ > 0, but they suggest different rates for the parameter settings θ < 1 and θ 1. For estimators such as SARAH, the bias depends on the error in the previous gradient estimate, rather than previous stochastic gradients: f(xk) Ek[e SARAH k ] = f(xk 1) e SARAH k 1 . We refer to estimators of this type as recursively biased. Definition 14 (Recursively biased gradient estimator) For any sequence {xk}, let e k be a stochastic gradient estimator generated from the points {xℓ}k ℓ=0. This estimator is recursively biased with parameters ρB (0, 1] and ν 1 if f(xk) Ek[e k] = ( 0 for k νN0, (1 ρB)( f(xk 1) e k 1) o.w. The parameter ν represents how many steps occur between full gradient evaluations. For SARGE, ν = because the full gradient is never computed. Recursively biased estimators admit a bound on the inner-product bias term that involves the estimator s MSE. Lemma 15 Suppose e is a recursively biased gradient estimator with parameters ν 1 and ρB (0, 1]. Then, for any epoch s N {0} and ϵ > 0, k=νs+1 |E f(xk 1) e k 1, xk x | min n ν, 1 ρB o Pν(s+1) 1 2 f(xk) e k 2 + 1 2ϵ xk+1 xk 2 . The proof of Lemma 15 is in Appendix B. Lemma 15 shows that, for recursively biased estimators, the inner-product bias term f(xk 1) e k 1, xk x is bounded from above by the MSE, implying that introducing bias to decrease the MSE is a reasonable approach to design improved gradient estimators. Remark 16 When ν = , which is true for SARGE, there is only a single epoch, s = 0. In this case, we adopt the convention 0 = 0, so that the sums appearing in Lemma 15 are well-defined. Driggs, Liang, Sch onlieb 4. Convergence rates In this section, we analyze the convergence rates for the stochastic gradient methods. We first provide very general convergence rates based on the bounds from the last section. Then, we specify the result to specific gradient estimators including memory-biased B-SAGA/BSVRG, and recursively biased SARAH and SARGE. 4.1 General convergence rates For Algorithm 1, we consider a constant step size ηk η > 0. Given T iterations of Algorithm 1, define the average iterate x T def = 1/T PT k=1 xk. 4.1.1 Convex and strongly convex cases The following theorem establishes convergence rates for memory-biased estimators in the convex regime. Theorem 17 (Memory-biased estimators) Let e be a memory-biased gradient estimator parameterized by θ 1 and B1 0, which satisfies the BMSE(M1, M2, ρM, ρF , m) property. Let Θ = M1ρF +2M2 ρMρF and ρ = min{ρM, ρF }. In the convex setting, let η = 1 L(1+3 E[F( x T ) F(x )] 1 2 + max n B1(1 1/θ) 2Θ 1, 0 o F(x0) F(x ) . When g is additionally µ-strongly convex with µ > 0, let η = min 1 3L(1+3 2Θ B1µ(1 1/θ), ρ 2µ . The iterate x T satisfies E[ x T x 2] (1 + µη) T 2 µ(F(x0) F(x )) + x0 x 2 . The proof of Theorem 17 is provided in Appendix A. The next result establishes convergence rates for recursively biased gradient estimators whose proof is in Appendix B. Theorem 18 (Recursively biased estimators) Let e be a recursively biased gradient estimator parameterized by ρB (0, 1) and ν 1, which satisfies the BMSE(M1, M2, ρM, ρF , m) property. Let B2 def = min {ν, 1/ρB}, Θ = M1ρF +2M2 ρMρF and ρ = min{ρM, ρF }. In the convex setting, let η = 1 2ΘL(1+ (1 ρB)B2 1+δ )+L with δ = max{ p LΘ1/2(1 ρB)B2 1, 0}. Then E[F( x T ) F(x )] 1 2η x0 x 2 + δ(F(x0) F(x )) . When g is additionally µ-strongly convex with µ > 0, let η = min 1 3L(4 1 µ(1 ρB)B2 , ρ E[ x T x 2] (1 + µη) T ( 2 µ(F(x0) F(x )) + x0 x 2). On Biased Stochastic Gradient Estimation Both theorems hold true for smaller η; the choices in the theorems are the largest ones allowed by our analysis. For B-SAGA and B-SVRG, Θ = O(n2), while for SARAH and SARGE, Θ = O(n). This gives these recursive gradient estimators improved convergence rates and suggests that the bias in these estimators is more effective than the bias in SAGA and SVRG. 4.1.2 Non-convex case The analysis of biased gradient estimators is simpler for the non-convex setting than the convex ones due to the absence of the inner-product bias term in (7). Below we provide a uniform convergence guarantee for all gradient estimators satisfying the BMSE property, regardless of their bias. This suggests that in the non-convex setting, a large-bias, small MSE gradient estimator is favorable over an estimator with small bias and large MSE. Theorem 20 Let e be a gradient estimator that satisfies the BMSE(M1, M2, ρM, ρF , m) property, let Θ = M1ρF +2M2 ρMρF , and let α be a chosen uniformly at random from the set {0, 1, , T 1}. If F is non-convex, set η = 16Θ+1 1 16LΘ in Algorithm 1, and the point xα satisfies the following bound on its generalized gradient: E[ Gη/2(xα) 2] 16(F(x0) F(x )) Tη(1 4ηL) . The proof of this result is provided in Appendix C. Remark 21 The convergence result of Theorem 20 does not depend on the bias except through the MSE of the gradient estimator, which implies that incorporating arbitrary amounts of bias for a smaller MSE improves the convergence rate. This fact is what allows the recursively biased estimators SARAH and SARGE to achieve the oracle complexity lower bound for non-convex optimization when they are used in Algorithm 1. 4.2 Convergence rates for specific gradient estimators In this section, we specialize the general convergence rates to analyze the performance of B-SAGA, B-SVRG, SARAH, and SARGE. 4.2.1 Biased SAGA and SVRG B-SAGA and B-SVRG are examples of memory-biased gradient estimators, as their biases take the form f(xk) Ek[e k] = (1 1 Pn i=1 fi(ϕi k) , for some previous iterates ϕi k. To establish convergence rates for B-SAGA and B-SVRG, we only need to show these estimators satisfy the BMSE property with suitable constants. Lemma 22 The B-SAGA gradient estimator is memory-biased with B1 = 2n(2n + 1), and it satisfies the BMSE property with parameters ρM = 1 2n, m = 1, M2 = 0, ρF = 1, and θ2 θ (0, 2], (2n + 1)(1 1 Driggs, Liang, Sch onlieb The proof of Lemma 22 uses a slight modification of existing variance bounds for the SAGA estimator, appearing in (Defazio et al., 2014a) for example. We include the proof in Appendix D. The B-SVRG gradient estimator satisfies the BMSE property with similar constants. Lemma 23 The B-SVRG gradient estimator is memory-biased with B1 = 3m(m + 1), and it satisfies the BMSE property with parameters ρM = 1, M2 = 0, ρF = 1, and θ2 θ (0, 2], 3m(m + 1)(1 1 With these constants established, Theorem 17 provides rates of convergence.1 Corollary 24 (Convergence rates for B-SAGA) Algorithm 1 achieves the following convergence guarantees using the B-SAGA gradient estimator: In the convex setting, depending on the choice of θ, set the step size to η = ηθ def = n(2n+1)) : θ [1, 2], 1 L(1+6(1 1 n(2n+1)) : θ > 2, and x T satisfies E[F( x T ) F(x )] = O(Ln If additionally g is µ-strongly convex, set η = min ηθ, 1 4µn . Then x T satisfies E[ x T x 2] = O((1 + µη) T ). In the non-convex setting, after T iterations, the generalized gradient at xα satisfies E[ Gη/2(xα) 2] = Tθ : η = θ 2L n(2n+1), θ (0, 2], θ ) : η = 1 2L(1 1 n(2n+1), θ > 2. Corollary 25 (Convergence rates for B-SVRG) Algorithm 1 achieves the following convergence guarantees using the B-SVRG gradient estimator: In the convex setting, depending on the choice of θ, set the step size to 6m(m+1)) : θ [1, 2], 1 L(1+3(1 1 6m(m+1)) : θ > 2. After S epochs, the point xm S satisfies E[F( xm S) F(x )] = O(L/S). If additionally g is µ-strongly convex, let η = min{ηθ, 1 2µ}. After S epochs, xm S satisfies E[ xm S x 2] = O((1 + µη) m S). In the non-convex setting, after S epochs, the generalized gradient at xα satisfies E[ Gη/2(xα) 2] = 3m(m+1), θ (0, 2], O L S(1 1/θ) : η = 3m(m+1), θ > 2. On Biased Stochastic Gradient Estimation Our MSE bounds and convergence rates are optimized when θ = 2. Numerical experiments (including those in Section 5) suggest that setting θ in the range 1 < θ n gives the best performance for convex problems, and B-SAGA prefers larger values of θ than B-SVRG. The convergence guarantees for θ (0, 1) still hold for convex objectives, but in this setting, the rates we obtain for θ 1 are superior, suggesting that for convex problems, θ should be larger than or equal to one for best performance. Our numerical experiments in Section 5 support this; setting θ < 1 can be beneficial for non-convex problems, but we do not observe this for convex problems. In the special case θ = 1, Corollaries 24 and 25 recover the state-of-the-art rates for SAGA and SVRG in the non-convex regime. For strongly convex problems, these rates are worse than existing convergence rates of O((1+min µ n ) T ) proven for SAGA and SVRG (Defazio et al., 2014a; Xiao and Zhang, 2014). This difference is due to the generality of Theorem 17, as some memory-biased estimators, including B-SVRG, exhibit poor performance on strongly convex problems when the bias is large. Corollaries 24 and 25 require step sizes that decrease with n, while existing results for SAG, SAGA, and SVRG allow step sizes that are independent of n. This is also due to the generality of Theorem 17. In practice, we find B-SAGA converges with step sizes that are independent of n, but B-SVRG requires step sizes to decrease when the epoch length is larger. 4.2.2 SARAH and SARGE The SARAH and SARGE gradient estimators are recursively biased, with f(xk) Ek[e SARAH k ] = f(xk 1) e SARAH k 1 and f(xk) Ek[e SARGE k ] = (1 1 n)( f(xk 1) e SARGE k 1 ). As we shall see, these biased estimators admit smaller MSE bounds than unbiased and memory-biased estimators, and this is reflected in their improved convergence rates. The following two lemmas establish the constants appearing in Theorem 18 for these estimators. Lemma 27 The SARAH gradient estimator is recursively biased with parameters ρB = 0 and ν = m, and it satisfies the BMSE property with M1 = m, ρM = 1, ρF = 1, and M2 = 0. Lemma 28 The SARGE gradient estimator is recursively biased with parameters ρB = 1/n and ν = , and it satisfies the BMSE property with M1 = 12, M2 = 39/n, ρM = 1 4n, ρF = 1 2n, and m = 1. Proofs of these results are included in Appendices E and F, respectively. It is enlightening to compare these BMSE constants to those of B-SVRG and B-SAGA. M1 is a factor of n smaller for the SARAH and SARGE estimators than for the B-SVRG and B-SAGA estimators (as long as m = O(n) in SARAH and B-SVRG). This translates to an O( n) improvement in the convergence rates for SARAH and SARGE when L is O( n) or larger. Driggs, Liang, Sch onlieb Corollary 29 (Convergence rates for SARAH) When using the SARAH gradient estimator in Algorithm 1, If F is convex, set η = 1 L(2 Lm3/4 . After T iterations, x T satisfies E[F( x T ) F(x )] = O(L m+ If F is µ-strongly convex, set η = min{ 1 3L(4 2m+1), 1 µm}, then E[ x T x 2] = O((1+ If F is non-convex, set η = 1 L 2m, then E[ Gη/2(xα) 2] O (L m/T). Corollary 30 (Convergence rates for SARGE) When using the SARGE gradient estimator in Algorithm 1, If F is convex, set η = 1 L(74 n+15 Ln3/4 +1), then E[F( x T ) F(x )] = O(L n+ If F is µ-strongly convex, set η = min{ 1 3L(16 3(n+13)+1), 1 4µn}, then E[ x T x 2] = O((1 + µη) T ). If F is non-convex, set η = 1 4L 3(n+13), then E[ Gη/2(xα) 2] O (L n/T). Remark 31 Our theoretical results suggest the step sizes for SARAH and SARGE should decrease with n, and such step sizes lead to optimal convergence guarantees for non-convex problems. However this is not true in practice, as we find that using larger step sizes that are independent of n leads to better performance. We provide examples in Section 5. These convergence rates for convex objectives represent a significant improvement over the performance of SAGA, SVRG, and full-gradient methods. Each of these algorithms require O(n L ϵ ) stochastic gradient evaluations to find a point satisfying F(x T ) F(x ) ϵ, while SARAH and SARGE require only O( n L ϵ ). These rates do not require the epochdoubling procedure of (Allen-Zhu and Yuan, 2018), although epoch-doubling can potentially be used to improve the performance of SARAH just as it improves the performance of SVRG on non-strongly convex objectives. This square-root dependence on n is present in the convergence rates for strongly convex and non-convex objectives as well, which is a significant improvement over the dependence on n in the convergence rates of B-SAGA and B-SVRG. This better dependence on n is most significant in the non-convex regime, where these convergence rates imply that the SARAH and SARGE estimators require only O(L n+ ϵ2 ) stochastic gradient evaluations to find an ϵ-approximate stationary point, which is the oracle-complexity lower bound (Fang et al., 2018). Similar results already exist for algorithms using the SARAH estimator (Fang et al., 2018; Wang et al., 2019, 2018; Pham et al., 2019). Our results for SARGE show that achieving this complexity is possible without ever computing the full gradient. 5. Numerical Experiments In this section, we present numerical experiments testing B-SAGA, B-SVRG, SARAH, and SARGE for minimizing convex, strongly convex, and non-convex objectives. We include one set of experiments comparing different values of θ in B-SAGA and B-SVRG with a On Biased Stochastic Gradient Estimation fixed step size and one set comparing SARAH and SARGE to B-SAGA and B-SVRG with the best values of θ.2 5.1 Convex and strongly convex objectives Let (hi, li) Rp { 1}, i = 1, , n be the training set, where hi Rp is the feature vector of each data sample, and li is the binary label. Let β > 0 be a tuning parameter. The ridge regression problem takes the form min x Rp 1 n Pn i=1 (h i x li)2 + β 2 ||x||2 2. LASSO is similar, but with the regularizer ||x||1 replacing ||x||2 2. These problems are of the form (1), where we set fi = (h i x li)2 and g equal to the regularizer. In ridge regression, g is strongly convex, and in LASSO, g is only convex. We consider four binary classification data sets: australian, mushrooms, phishing, and ijcnn1 from LIBSVM3. We rescale the value of the data to [ 1, 1], set β = 1/n, and set the step size to η = 1 5L. To compare performance, we use the objective function value F(xk) F(x ) is considered. Comparison of B-SAGA We first compare the performance of B-SAGA under different choices of θ for solving ridge regression and LASSO problems. Four choices of θ are considered: θ {1, 10, 100, n}, the results are provided below in Figure 1, from which we observe that B-SAGA consistently performs better with moderate amounts of bias (that is, θ (1, n)). For the considered datasets, overall θ = 10 provides the best performance. Comparison of B-SVRG We also consider four choices of θ for B-SVRG, which are θ {0.5, 0.8, 1, 1.5}. The results are shown below in Figure 2. We observe that B-SVRG is more sensitive to the choice of θ; only small amounts of bias (that is, θ [0.8, 1.5]) can occasionally improve performance. Comparison of different gradient estimators Finally, we provide comparison of SAGA, B-SAGA with θ = 10, SVRG, SARAH and SARGE, the results are provided below in Figure 3 from which we observe that SARAH performs similarly to SVRG, but is occasionally slower in early epochs. SARGE consistently outperforms all other methods except for B-SAGA with θ = 10. The above observations indicate that, depending on the data, biased schemes can benefit from their biased gradient estimates. The free parameter θ reduces the MSE of the B-SAGA and B-SVRG gradient estimators leading to better performance, and the bias in SARAH and SARGE has a similar effect. 5.2 Non-convex objectives To test the effect of bias in the non-convex setting, we consider the non-negative principal component analysis (NN-PCA) problems, which can be formulated as (Reddi et al., 2016b): min x Rp F(x) 1 Pn i=1 (h i x)2 + ιC(x) , 2. See https://github.com/derekdriggs/Stoch Opt for MATLAB scripts reproducing these experiments. 3. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ Driggs, Liang, Sch onlieb 10 20 30 40 50 60 70 80 90 100 10-15 (a) australian 0 50 100 150 200 250 300 10-15 (b) mushrooms 20 40 60 80 100 120 140 160 180 200 10-15 (c) phishing 20 40 60 80 100 120 140 160 180 200 10-15 0 20 40 60 80 100 120 10-15 (e) australian 0 50 100 150 200 250 10-15 (f) mushrooms 0 50 100 150 200 10-15 (g) phishing 0 20 40 60 80 100 120 10-15 Figure 1: First row, performance comparison fitting a LASSO model for different choices of θ in B-SAGA. Second row, performance comparison fitting a ridge regression model for different choices of θ in B-SAGA. The step size for each case is set to η = 1 5L. 0 20 40 60 80 100 120 10-15 (a) australian 0 50 100 150 200 250 300 10-15 (b) mushrooms 0 50 100 150 200 250 10-15 (c) phishing 0 20 40 60 80 100 10-15 0 20 40 60 80 100 120 10-15 (e) australian 0 50 100 150 200 250 10-15 (f) mushrooms 0 50 100 150 200 250 10-15 (g) phishing 0 10 20 30 40 50 60 10-15 Figure 2: First row, performance comparison fitting a ridge regression model for different choices of θ in B-SVRG. Second row, performance comparison fitting a LASSO model for different choices of θ in B-SVRG. The step size for each case is set to η = 1 5L. where C def = {x Rp : x 1, x 0} is a convex set and ιC(x) = + : x / C is the indicator function of C. Letting g = ιC, the operator proxηg is the projection onto C, which can be computed efficiently. On Biased Stochastic Gradient Estimation 0 20 40 60 80 100 10-15 (a) australian 0 50 100 150 200 250 300 350 10-15 (b) mushrooms 0 5 10 15 20 25 30 35 40 10-15 (c) phishing 0 5 10 15 20 25 30 35 40 10-15 0 20 40 60 80 100 120 10-15 (e) australian 0 50 100 150 200 250 300 10-15 (f) mushrooms 0 10 20 30 40 50 10-15 (g) phishing 0 10 20 30 40 50 10-15 Figure 3: First row, performance comparison for solving ridge regression among different algorithms. Second row, performance comparison for solving LASSO regression among different algorithms. For both examples, step sizes are tuned automatically to minimize the number of iterations required to reach a suboptimality of 10 15. As the problem is non-convex, we cannot measure convergence with respect to the global optimum x , so we use many iterations of proximal gradient descent with a small step size (η = 1 10Ln) to find a reference point x . Every test is initialized using a random vector with normally distributed i.i.d. entries, and the same starting point is used for testing each value of θ. We found that small step sizes generally lead to stationary points with smaller objective values, so we set η = 1 5n for all our experiments. We report F(xk) F(x ) averaged over every n iterations. These experiments show that the performance of B-SAGA and B-SVRG varies significantly with θ, with smaller values leading to better performance. SARAH and SARGE perform similarly to SAGA and SVRG in these experiments, see Figure 4. For the comparison of all algorithms, B-SAGA and B-SVRG provides the best performance with B-SVRG being slightly faster. 6. Conclusion The complicated convergence proofs of biased stochastic gradient methods have restricted researchers to studying unbiased estimators almost exclusively. Our simple framework for proving convergence rates for biased algorithms overcomes this limitation. Our analysis allows for the study of biased algorithms with proximal support for minimizing convex, strongly convex, and non-convex objectives for the first time. We also show that biased gradient estimators can offer improvements over unbiased estimators in theory and in practice. Most notably, we find that biased recursive gradient estimators, such as SARAH and SARGE, admit smaller bounds on their MSEs and faster convergence rates than SAGA and SVRG. Driggs, Liang, Sch onlieb 0 100 200 300 400 500 10-3 (a) australian 0 200 400 600 800 100 10-3 (b) mushrooms 0 50 100 150 200 250 300 10-4 (c) phishing 0 200 400 600 800 100 10-3 0 100 200 300 400 500 (e) australian 0 100 200 300 400 500 10 -2 (f) mushrooms 0 100 200 300 400 500 (g) phishing 0 100 200 300 400 500 10 -2 Figure 4: First row, performance comparison for solving NN-PCA with different choices of θ in B-SAGA. Second row, performance comparison for solving NN-PCA with different choices of θ in B-SVRG. The step size for each case is set to η = 1 5Ln. The point x is found by solving the problem using proximal gradient descent. 0 100 200 300 400 500 10-3 (a) australian 0 100 200 300 400 500 10-3 (b) mushrooms 0 100 200 300 400 500 10-2 (c) phishing 0 100 200 300 400 500 10-3 Figure 5: Performance comparison for solving NN-PCA among different algorithms. All step sizes are set to 1 5Ln. Objective values are averaged over each epoch (n steps). Acknowledgements CBS acknowledges support from the Leverhulme Trust project on Breaking the Non-Convexity Barrier and on Unveiling the Invisible, the Philip Leverhulme Prize, the EPSRC grant No. EP/M00483X/1, the EPSRC Centre No. EP/N014588/1, the European Union Horizon 2020 research and innovation programs under the Marie Sk lodowska-Curie grant agreement No. 691070, the Cantab Capital Institute for the Mathematics of Information, and the Alan Turing Institute. JL acknowledges the support from Leverhulme Trust and Newton Trust. On Biased Stochastic Gradient Estimation The organization of the appendix is as follows: we prove Theorems 17 and 18 in Appendices A and B, respectively, and we prove Theorem 20 in Appendix C. We provide convergence rates for B-SAGA and B-SVRG as special cases of Theorem 17 in Appendix D, and we provide convergence rates for SARAH and SARGE as special cases of Theorem 18 in Appendices E and F, respectively. Appendix A. Proof of Theorem 17 To prove Theorem 17, we begin by showing that the BMSE property (Definition 9) implies the MSE of the gradient estimator over T iterations is proportional to PT 1 k=0 E xk+1 xk 2. Lemma 32 (MSE bound) Suppose that the stochastic gradient estimator e satisfies the BMSE(M1, M2, ρM, ρF , m) property, let ρ = min{ρM, ρF }, and let σs be any sequence satisfying σs(1 ρ)ms σs 1(1 ρ 2)ms. For convenience, define Θ = M1ρF +2M2 ρMρF . The MSE of the gradient estimator is bounded as k=ms E[ f(xk) e k 2] 2ΘL2 PS k=ms E[ xk+1 xk 2]. Proof First, we derive a bound on the sequence Fms arising in the BMSE property. Summing this sequence from s = 0 to s = S, s=0σs Fms PS ℓ=0 M2σs(1 ρF )m(s ℓ) i=1E[ fi(xk+1) fi(xk) 2] ℓ=0 M2σℓ(1 ρF i=1E[ fi(xk+1) fi(xk) 2] i=1E[ fi(xk+1) fi(xk) 2] i=1E[ fi(xk+1) fi(xk) 2]. Inequality 1 uses the fact that σs(1 ρF )ms σs 1(1 ρF 2 )ms. With this bound on Fms, we proceed to bound Mms similarly. s=0 σs Mms 1 PS s=0σs Fms + M1 i=1E[ fi(xk+1) fi(xk) 2] + (1 ρM)m PS s=1 σs Mm(s 1) 2 PS s=0σs M1ρF +2M2 i=1E[ fi(xk+1) fi(xk) 2] s=1 σs 1Mm(s 1) s=0σs M1ρF +2M2 i=1E[ fi(xk+1) fi(xk) 2] s=1 σs 1 M1ρF +2M2 i=1E[ fi(xk+1) fi(xk) 2] + s=0 σs M1ρF +2M2 i=1E[ fi(xk+1) fi(xk) 2] i=1E[ fi(xk+1) fi(xk) 2] k=ms E[ xk+1 xk 2]. Driggs, Liang, Sch onlieb Inequality 1 uses the fact that Mm (1 ρM)m Mm(s 1). Inequality 2 uses σs(1 ρM)ms σs 1(1 ρM 2 )ms, 3 uses the same estimate we applied in (9), and 4 uses the Lipschitz continuity of fi. Proof of Lemma 12 By assumption, 1 1 θ 0, so we can apply convexity to obtain η θ (f(xk) f(x )) + η i=1fi(ϕi k) fi(x ) θ f(xk), xk x + η i=1 fi(ϕi k), ϕi k x θ f(xk), xk x + η i=1 fi(ϕi k), xk x + η i=1 fi(ϕi k), ϕi k xk . Because e k is memory-biased, hence 1 θ)Pn i=1 fi(ϕi k) = Ek[e k]. Therefore, η θ f(xk), xk x + η i=1 fi(ϕi k), xk x = Ek η e k, xk x = Ek[η e k, xk xk+1 + η e k, xk+1 x ] Ek η e k, xk xk+1 1 2 xk+1 xk 2 + 1 2 xk x 2 1+µη 2 xk+1 x 2 ηg(xk+1) + ηg(x ) . The inequality is due to Lemma 3 with z = xk+1, x = xk, d = e k, and y = x . Combining these two inequalities, we have shown η θ (f(xk) f(x )) + η i=1 fi(ϕi k) fi(x ) Ek η e k, xk xk+1 1 2 xk+1 xk 2 ηg(xk+1) + ηg(x ) 2 xk x 2 1+µη 2 xk+1 x 2 + η i=1 fi(ϕi k), ϕi k xk . We bound the first three terms on the right further. η e k, xk xk+1 1 2 xk+1 xk 2 ηg(xk+1) = η( f(xk), xk xk+1 g(xk+1)) + η e k f(xk), xk xk+1 1 2 xk+1 xk 2 1 η(f(xk) F(xk+1)) + η e k f(xk), xk xk+1 + ( ηL 2) xk+1 xk 2 2 η(f(xk) F(xk+1)) + η 2Lλ e k f(xk) 2 + ( ηL(λ+1) 2) xk+1 xk 2. Inequality 1 is due to the Lipschitz continuity of f, and inequality 2 is Young s. Combining this bound with (10) and rearranging terms, we have shown that 0 ηEk[F(xk+1) F(x )] + η 2LλEk[ e k f(xk) 2] 2 Ek[ xk+1 x 2] + 1 2 xk x 2 + ( ηL(λ+1) 2)Ek[ xk+1 xk 2] i=1 fi(ϕi k) + 1 i=1 fi(ϕi k), ϕi k xk . We use Lemma 1 to bound the final term, yielding the desired inequality. Proof of Theorem 17 (Convex Case) We begin with the inequality of Lemma 12 with µ = 0. Multiplying the inequality of Lemma 4 with z = xk+1, x = xk, and d = e k by a On Biased Stochastic Gradient Estimation non-negative constant δ and adding it to the inequality of Lemma 12, we obtain ηEk[F(xk+1) F(x ) + δ(F(xk+1) F(xk))] 2Lλ Ek[ e k f(xk) 2] 1 2Ek[ xk+1 x 2] + 1 + ( ηL(1+δ)(λ+1) 2 )Ek[ xk+1 xk 2] + ηL i=1 xk ϕi k 2. Applying the full expectation operator and summing from k = 0 to k = T 1, we have k=0 ηE[F(xk+1) F(x )] + ηδ(E[F(x T )] F(x0)) 2E[ x T x 2] + 1 2 x0 x 2 + PT 1 k=0 E η(1+δ) 2Lλ e k f(xk) 2 + ( ηL(1+δ)(λ+1) 2 ) xk+1 xk 2 + ηL i=1 xk ϕi k 2 . We use Lemma 32 with σs = 1 to bound the MSE, and we use the fact that the gradient estimator is memory-biased to bound the term 1/n Pn i=1 xk ϕi k 2. This leaves k=0 E[F(xk+1) F(x )] 2E[ x T x 2] + 1 2 x0 x 2 + ηδ(F(x0) E[F(x T )]) + ( ηL(1+δ)(λ+1) 2 + ΘηL(1+δ) k=0 E[ xk+1 xk 2]. Setting λ = 2Θ minimizes the coefficient of the term on the final line. With 2Θ+ B1(1 1/θ) the final term in (12) is non-positive, so we can drop it from the inequality along with the term 1/2E x T x 2. Using the fact that F(x T ) F(x ), this leaves PT 1 k=0 E[F(xk+1) F(x )] 1 2η x0 x 2 + δ(F(x0) F(x )). We use the convexity of F to rewrite this inequality as a bound on the suboptimality of the average iterate E[F( x T ) F(x )] 1 PT 1 k=0 E[F(xk+1) F(x )] 1 2ηT x0 x 2 + δ T (F(x0) F(x )). Setting δ = max{B1(1 1/θ)/ 2Θ 1, 0} approximately minimizes the right side of this inequality, completing the proof. Proof of Theorem 17 (Strongly Convex Case) As in the proof of the convex case, we begin with the inequality of Lemma 12, multiply the inequality of Lemma 4 with z = xk+1, x = xk, and d = e k by a non-negative constant δ, and add the two inequalities. ηEk[F(xk+1) F(x ) + δ(F(xk+1) F(xk))] 2 Ek[ xk+1 x 2] + 1 2 xk x 2 + Ek h η(1+δ) 2Lλ e k f(xk) 2 + ( ηL(1+δ)(λ+1) 2 1+δ(2+µη) 2 ) xk+1 xk 2 + ηL i=1 xk ϕi k 2i . Driggs, Liang, Sch onlieb Applying the full expectation operator, multiplying by (1 + µη)k, and summing over the epoch k = ms to k = m(s + 1) 1 for some s N0, we have η Pm(s+1) 1 k=ms (1 + µη)k E[F(xk+1) F(x ) + δ(F(xk+1) F(xk))] (1+µη)m(s+1) 2 E xm(s+1) x 2 + (1+µη)ms 2 E xms x 2 + Pm(s+1) 1 k=ms (1 + µη)k E h η(1+δ) 2Lλ e k f(xk) 2 + ( ηL(1+δ)(λ+1) 2 1+δ(2+µη) 2 ) xk+1 xk 2 i=1 xk ϕi k 2i . Using the fact that η 1 µm, (1+µη)k (1+µη)m(s+1) (1+µη)ms lim m (1+ 1 m)m = e(1+µη)ms 3(1+µη)ms, (13) where e is Euler s number. Therefore, η Pm(s+1) 1 k=ms (1 + µη)k E[F(xk+1) F(x ) + δ(F(xk+1) F(xk))] (1+µη)m(s+1) 2 E[ xm(s+1) x 2] + (1+µη)ms + (1 + µη)ms Pm(s+1) 1 k=ms E h ( 3ηL(1+δ)(λ+1) 2 1+δ(2+µη) 2 ) xk+1 xk 2 2Lλ e k f(xk) 2 + 3ηL i=1 xk ϕi k 2i . Summing the inequality from epoch s = 0 to s = S 1, k=0 (1 + µη)k E[F(xk+1) F(x ) + δ(F(xk+1) F(xk))] s=0 (1 + µη)ms m(s+1) 1 X k=ms E h 3η(1+δ) 2Lλ e k f(xk) 2 + ( 3ηL(1+δ)(λ+1) 2 1+δ(2+µη) 2 ) xk+1 xk 2 i=1 xk ϕi k 2i (1+µη)m S 2 E xms x 2 + 1 We use Lemma 32 with σs = (1 + µη)ms to bound the MSE. Recall ρ = min{ρM, ρF } and η ρ 2µ. This choice for σs satisfies the conditions of Lemma 32 because (1+µη)ms(1 ρ)ms (1 + µη)m(s 1)(1 ρ/2)ms. We use the fact that the gradient estimator is memory-biased to bound the term 1/n Pn i=1 xk ϕi k 2. This leaves k=0 (1 + µη)k E[F(xk+1) F(x ) + δ(F(xk+1) F(xk))] 2 E[ xm S x 2] + 1 2 x0 x 2 + C PS 1 s=0 (1 + µη)ms Pm(s+1) 1 k=ms E[ xk+1 xk 2], (15) where C = 3ηL(1+δ)(λ+1) 2 + 3ΘηL(1+δ) θ) 1+δ(2+µη) 2 . We must choose η, λ, and δ so that C 0. Setting λ = 2Θ minimizes C over λ. Using the approximation δ(2 + µη) δ, we see that C is non-positive if 2Θ+ B1(1 1/θ) Setting δ = max{B1(1 1/θ)/ 2Θ 1, 0}, we are guaranteed that 2Θ) 1 3L(1+2 2Θ+ B1(1 1/θ) On Biased Stochastic Gradient Estimation so the step size in the theorem statement ensures C 0, and the final term in (15) is non-positive. Dropping this non-positive term from the inequality, we have η(1 + δ) Pm S 1 k=0 (1 + µη)k E[F(xk+1) F(x )] + δη Pm S 1 k=0 (1 + µη)k E[F(xk) F(x )] 2 E[ xm S x 2] + 1 2 x0 x 2. (16) We would like to show that 1 + δ (1 + µη)δ so that the terms on the first line telescope. We use the fact that η 2Θ B1µ(1 1/θ) to say 1 µη B1(1 1/θ) δ 1 + µη and inequality (16) simplifies to (1 + µη)m SE[ηδ(F(xm S) F(x )) + 1 2 xm S x 2] ηδ(F(x0) F(x )) + 1 which implies the result. Appendix B. Proof of Theorem 18 The following two lemmas establish an analogue of Lemma 12 for recursively biased estimators. Lemma 33 Suppose e is recursively biased with parameters ρB and ν. Suppose g is µstrongly convex with µ 0, and let λ > 0 be a constant whose value we determine later. The following inequality holds: 0 ηEk[F(xk+1) F(x )] + η 2LλEk[ e k f(xk) 2] 2 Ek[ xk+1 x 2] + 1 + ( ηL(λ+1) 2)Ek[ xk+1 xk 2] + η(1 ρB) f(xk 1) e k 1, xk x . Proof Applying the convexity of f yields η(f(xk) f(x )) η f(xk), xk x = η f(xk) (1 ρB)( f(xk 1) e k 1), xk x + η(1 ρB) f(xk 1) e k 1, xk x . Because the estimator is recursively biased, Ek[e k] = f(xk) (1 ρB)( f(xk 1) e k 1). η f(xk) (1 ρB)( f(xk 1) e k 1), xk x = Ek[η e k, xk x ] = Ek[η e k, xk xk+1 + η e k, xk+1 x ] Ek[η e k, xk xk+1 1 2 xk+1 xk 2 + 1 2 xk+1 x 2 + ηg(xk+1) ηg(x )], Driggs, Liang, Sch onlieb The inequality is due to Lemma 3. The rest of the proof follows the proof of Lemma 12. Proof of Lemma 15 We begin with a technical lemma. Lemma 34 Let {Eℓ} ℓ=0 be a sequence of non-negative real numbers. For any ρ (0, 1] and c 1, the following inequality holds: Pc(s+1) 1 k=cs+2 Pk 1 ℓ=cs+1 (1 ρ)k ℓ 1Eℓ min {c, 1/ρ} Pc(s+1) 1 k=cs+1 Ek. Proof This inequality can be seen by expanding the sums. ℓ=cs+1(1 ρ)k ℓ 1Eℓ = [Ecs+1] + [(1 ρ)Ecs+1 + Ecs+2] + (1 ρ)2Ecs+1 + (1 ρ)Ecs+2 + Ecs+3 + + (1 ρ)c 2Ecs+1 + + Ec(s+1) 2 = 1 + (1 ρ) + + (1 ρ)c 2 Ecs+1 + 1 + (1 ρ) + + (1 ρ)c 3 Ecs+2 + . The coefficient of each term of the sequence Eℓis less than c 2. Therefore, replacing each coefficient with c, we obtain the bound Pc(s+1) 1 k=cs+2 Pk 1 ℓ=cs+1 (1 ρ)k ℓ 1Eℓ c Pc(s+1) 1 k=cs+1 Ek. This proves the first bound. For the second bound, notice that each coefficient is less than P k=0(1 ρ)k = 1/ρ, proving the second bound. First, we use the fact that e k is recursively biased. E f(xk 1) e k 1, xk x 1 = E[ f(xk 1) e k 1, xk xk 1 + f(xk 1) Ek 1 e k 1, xk 1 x ] 2 f(xk 1) e k 1 2 + 1 2ϵ xk xk 1 2 + f(xk 1) Ek 1 e k 1, xk 1 x 2 f(xk 1) e k 1 2 + 1 2ϵ xk xk 1 2 + (1 ρB) f(xk 2) e k 2, xk 1 x . We can pass the conditional expectation Ek 1 into the second inner-product in 1 because xk 1 is independent of jk 1. Inequality 2 is Young s, and 3 uses the definition of a recursively biased gradient estimator. This is a recursive inequality, and expanding the recursion gives E f(xk 1) e k 1, xk x Pk 1 ℓ=νs+1(1 ρB)k ℓ 1E ϵ 2 f(xℓ) e ℓ 2 + 1 2ϵ xℓ+1 xℓ 2 + (1 ρB) f(xνs) e νs, xνs+1 x ℓ=νs+1(1 ρB)k ℓ 1E ϵ 2 f(xℓ) e ℓ 2 + 1 2ϵ xℓ+1 xℓ 2 . On Biased Stochastic Gradient Estimation Equality 1 is due to the fact that e νs = f(xνs). Taking the absolute value and summing this from k = νs + 1 to k = ν(s + 1) 1, k=νs+1 |E f(xk 1) e k 1, xk x | ℓ=νs+1 (1 ρB)k ℓ 1E ϵ 2 f(xℓ) e ℓ 2 + 1 2ϵ xℓ+1 xℓ 2 2 min n ν, 1 o Pν(s+1) 1 2 f(xk) e k 2 + 1 2ϵ xk+1 xk 2 . Inequality 2 follows from the technical Lemma 34. Summing this inequality from s = 0 to s = S completes the proof. Proof of Theorem 18 (Convex Case) To begin, we sum the inequality of Lemma 33 and the inequality of Lemma 4 scaled by δ > 0 with z = xk+1, x = xk, and d = e k. ηEk[F(xk+1) F(x ) + δ(F(xk+1) F(xk))] 2Ek[ xk+1 x 2] + 1 2 xk x 2 + Ek h η(1+δ) 2Lλ e k f(xk) 2 + (1 + δ)( ηL(λ+1) 2) xk+1 xk 2 + η(1 ρB) f(xk 1) e k 1, xk x i . Applying the full expectation operator, setting µ = 0, and summing from k = 0 to k = T 1 where T = m S for some S N, we have k=0 E[F(xk+1) F(x )] + ηδE[F(x T ) F(x0)] 2E[ x T x 2] + 1 2 x0 x 2 + PT 1 k=0 E h η(1+δ) 2Lλ e k f(xk) 2 + (1 + δ)( ηL(λ+1) 2) xk+1 xk 2 + η(1 ρB) f(xk 1) e k 1, xk x i . We use Lemma 15 to bound the inner-product bias term. k=0 E[F(xk+1) F(x )] + ηδE[F(x T ) F(x0)] 2E[ x T x 2] + 1 2 x0 x 2 + PT 1 k=0 E h ( η(1+δ) 2Lλ + B2η(1 ρB)ϵ 2 ) e k f(xk) 2 + (1 + δ)( ηL(λ+1) 2 + B2η(1 ρB) 2) xk+1 xk 2i . To bound the MSE, we use Lemma 32 with σs = 1. This leaves k=0 E[F(xk+1) F(x )] + ηδE[F(x T ) F(x0)] 2E[ x T x 2] + 1 2 x0 x 2 + w PT 1 k=0 E[ xk+1 xk 2], (18) where w = ηL(λ+1)(1+δ) 2 + B2η(1 ρB) 2ϵ + ΘηL(1+δ) λ + B2ηL2(1 ρB)ϵΘ 1+δ 2 . To minimize the coefficient of the final term, we set λ = 2Θ and ϵ = (2L2Θ) 1/2. This coefficient is then equal to 2ΘηL(1 + δ) + ηL(1+δ) 2(1 ρB)ηLB2 Driggs, Liang, Sch onlieb which is non-positive when η 1 2 2ΘL(1+ (1 ρB)B2 1+δ )+L. This ensures that the final term in (18) is non-positive, so we can drop it from the inequality along with the term 1/2E x T x 2. This leaves PT 1 k=0 E[F(xk+1) F(x )] 1 2η x0 x 2 + δE[F(x0) F(x T )]. By the convexity of F and the fact that F(x T ) F(x ) E[F( x T ) F(x )] 1 PT 1 k=0 E[F(xk+1) F(x )] 1 2ηT x0 x 2 + δ T (F(x0) F(x )). We now choose δ so that this upper-bound is approximately minimized. Assuming x0 x 2 and F(x0) F(x ) are approximately equal to K > 0, the right side becomes 2ΘL(1 ρB)B2 1 + δ + δ + 2 Choosing δ = max{ q Θ(1 ρB)B2 1, 0} approximately minimizes the right side of this inequality, completing the proof. Proof of Theorem 18 (Strongly Convex Case) We begin with inequality (17), but without setting µ = 0. η(1 + δ)Ek[F(xk+1) F(x )] + 1+µη 2 Ek xk+1 x 2 ηδ(F(xk) F(x )) + 1 2 xk x 2 + Ek h η(1+δ) 2Lλ e k f(xk) 2 + (1 + δ)( ηL(λ+1) 2) xk+1 xk 2 + η(1 ρB) f(xk 1) e k 1, xk x i . Applying the full expectation operator, multiplying by (1 + µη)k, and summing over the epoch k = ms to k = m(s + 1) 1 for some s N0, we have η(1 + δ) Pm(s+1) 1 k=ms (1 + µη)k E[F(xk+1) F(x )] + (1+µη)m(s+1) 2 E[ xm(s+1) x 2] ηδ Pm(s+1) 1 k=ms (1 + µη)k E[F(xk) F(x )] + (1+µη)ms 2 E[ xms x 2] + Pm(s+1) 1 k=ms (1 + µη)k E h η(1+δ) 2Lλ e k f(xk) 2 + (1 + δ)( ηL(λ+1) 2) xk+1 xk 2 + η(1 ρB) f(xk 1) e k 1, xk x i . Choosing δ so that 1+δ (1+µη)δ, the terms involving F(xk+1) F(x ) telescope, giving the inequality (1 + µη)m(s+1)E[δη(F(xm(s+1)) F(x )) + 1 2 xm(s+1) x 2] δη(1 + µη)ms E[F(xms) F(x )] + (1+µη)ms 2 E xms x 2 + Pm(s+1) 1 k=ms (1 + µη)k E h η(1+δ) 2Lλ e k f(xk) 2 + (1 + δ)( ηL(λ+1) 2) xk+1 xk 2 + η(1 ρB) f(xk 1) e k 1, xk x i . On Biased Stochastic Gradient Estimation We would like to bound the inner-product bias term using Lemma 15, and we can do this after some manipulation. Because η 1 µm, we have (1 + µη)k (1 + 1 m)k ms(1 + µη)ms 3(1 + µη)ms. Using the same estimate as in equations (13) and (14), we can say Pm(s+1) 1 k=ms (1 + µη)k E[ f(xk 1) e k 1, xk x ] 3(1 + µη)ms Pm(s+1) 1 k=ms |E[ f(xk 1) e k 1, xk x ]|, which produces the inequality (1 + µη)m(s+1)E[δη(F(xm(s+1)) F(x )) + 1 2 xm(s+1) x 2] δη(1 + µη)ms E[F(xms) F(x )] + (1+µη)ms 2 E xms x 2 + (1 + µη)ms Pm(s+1) 1 k=ms E h 3η(1+δ) 2Lλ e k f(xk) 2 + (1 + δ)( 3ηL(λ+1) 2) xk+1 xk 2i + 3η(1 ρB)|E f(xk 1) e k 1, xk x | . Summing this inequality from s = 0 to s = S 1, (1 + µη)m SE[δη(F(xm S) F(x )) + 1 2 xm S x 2] δη(F(x0) F(x )) + 1 s=0 (1 + µη)ms Pm(s+1) 1 k=ms E h (1 + δ)( 3ηL(λ+1) 2) xk+1 xk 2i 2Lλ e k f(xk) 2 + 3η(1 ρB)|E f(xk 1) e k 1, xk x | . We use Lemma 32 with σs = (1 + µη)ms to bound the MSE and Lemma 15 to bound the inner-product bias term. (1 + µη)m SE[δη(F(xm S) F(x )) + 1 2 xm S x 2] δη(F(x0) F(x )) + 1 2 x0 x 2 + w PS 1 k=ms (1 + µη)ms E xk+1 xk 2, (19) where w = 3ηL(λ+1)(1+δ) 2 + 3B2η(1 ρB) 2ϵ + 3ΘηL(1+δ) λ + 3B2ηL2(1 ρB)ϵΘ 1+δ 2 . To minimize the coefficient of the final term, we set λ = 2Θ and ϵ = (2L2Θ) 1/2. This coefficient is then equal to 2ΘηL(1 + δ) + 3ηL(1+δ) 2(1 ρB)ηLB2 2ΘL(1+ (1 ρB)B2 this term is non-positive. Setting δ = max{(1 ρB)B2 1, 0}, we are assured that 2ΘL(1+ (1 ρB)B2 so the final term in (19) is non-positive, and we can drop it from the inequality. The resulting inequality is (1 + µη)T E[δη(F(x T ) F(x )) + 1 2 x T x 2] δη(F(x0) F(x )) + 1 Driggs, Liang, Sch onlieb All that remains is to show that our choice for δ satisfies (1 + δ) (1 + µη)δ. Using the fact that η 1 (1 ρB)B2µ we can say 1 µη (1 ρB)B2 δ which ensures that (1 + δ) (1 + µη)δ and concludes the proof. Appendix C. Proof of Theorem 20 Theorem 20 follows immediately from inequality (7) and the MSE bound of Lemma 32. Proof of Theorem 20 Summing inequality (7) from k = 0 to k = T 1 and applying the full expectation operator, we obtain 0 E[F(x T )] + F(x0) + (L 1 4η) PT 1 k=0 E[ ˆxk+1 xk 2] 2 1 4η) PT 1 k=0 E[ xk+1 xk 2] + 2η PT 1 k=0 E[ f(xk) e k 2]. We bound the MSE using Lemma 32 with σs = 1. 0 E[F(x T )] + F(x0) + (L 1 4η) PT 1 k=0 E ˆxk+1 xk 2 2 + 4ΘηL2 1 4η) PT 1 k=0 E xk+1 xk 2. With η 16Θ+1 1 16LΘ , the final term is non-positive, so we can drop it from the inequality. Using the fact that F(x T ) F(x ), our inequality simplifies to 4η) PT 1 k=0 E[ ˆxk+1 xk 2] F(x0) F(x ). Writing the left side in terms of the generalized gradient, we have the bound k=0 E[ Gη/2(xk) 2] 16(F(x0) F(x )) With xα chosen uniformly at random from the set {xk}T 1 k=0 , this is equivalent to E[ Gη/2(xα) 2] 16(F(x0) F(x )) η(1 4ηL)T . This completes the proof. Appendix D. Proofs of convergence rates for B-SAGA and B-SVRG The following lemma establishes an MSE bound on the B-SAGA and B-SVRG gradient estimators. For the unbiased case θ = 1, this result was essentially first proved in (Defazio et al., 2014a), but the authors ultimately use a looser variance bound. On Biased Stochastic Gradient Estimation Lemma 35 The MSEs of the B-SAGA and B-SVRG gradient estimators satisfy Ek[ e k f(xk) 2] 1 nθ2 i=1 fi(xk) fi(ϕi k) 2 + (1 2 i=1 fi(ϕi k) 2. Proof Let e k e B-SAGA k or e B-SVRG k . The proof amounts to computing the expectation of the estimator and applying the Lipschitz continuity of fi. Ek[ e k f(xk) 2] = Ek[ 1 θ( fjk(xk) fjk(ϕjk k )) f(xk) 1 i=1 fi(ϕi k) 2] θ2 Ek[ fjk(xk) fjk(ϕjk k ) 2] + f(xk) 1 i=1 fi(ϕi k) 2 θEk[ fjk(xk) fjk(ϕjk k ), f(xk) 1 i=1 fi(ϕi k) ] i=1 fi(xk) fi(ϕi k) 2 + (1 2 i=1 fi(ϕi k) 2, which is the desired result. The following two lemmas establish the constants in the BMSE property for the B-SAGA and B-SVRG estimators. Proof of Lemma 22 We begin with the inequality of Lemma 35 and consider two cases. Case 1. Suppose θ [1, 2]. In this case the second term in (20) is non-positive, so we drop it from the inequality. For the remaining term, we use the following bound. i=1 E[ fi(xk) fi(ϕi k) 2] i=1 E[ fi(xk) fi(xk 1) 2] + 1 i=1 E[ fi(xk 1) fi(ϕi k) 2] i=1 E[ fi(xk) fi(xk 1) 2] + 1 i=1 E[ fi(xk 1) fi(ϕi k 1) 2] i=1 E[ fi(xk) fi(xk 1) 2] + 1 i=1 E[ fi(xk 1) fi(ϕi k 1) 2]. (21) Inequality 1 is the standard inequality a c 2 (1 + δ) a b 2 + (1 + δ 1) b c 2 (where we let δ = 1 2n). Inequality 2 follows from the definition of ϕi k and computing the expectation over jk 1, and 3 uses the fact that (1 + 1 2n)(1 1 n) (1 1 2n). Altogether, this gives E[ e SAGA k f(xk) 2] i=1 E[ fi(xk) fi(ϕi k) ] i=1 E[ fi(xk) fi(xk 1) 2] + 1 nθ2 (1 1 i=1 E[ fi(xk 1) fi(ϕi k 1) 2]. With Mk = 1 nθ2 Pn i=1 E[ fi(xk) fi(ϕi k) 2], it is clear that the SAGA estimator satisfies the BMSE property with M1 = 2n+1 θ2 , ρM = 1 2n, M2 = 0, ρF = 1, and m = 1. Driggs, Liang, Sch onlieb Case 2. Suppose θ > 2, so that the second term in (20) is non-negative. Jensen s inequality gives Ek[ e k f(xk) 2] 1 θ)2 Pn i=1 fi(xk) fi(ϕi k) 2. Following the argument of Case 1, it is easy to see that the B-SAGA gradient estimator satisfies the BMSE property with Mk = 1 n(1 1 θ)2 Pn i=1 fi(xk) fi(ϕi k) 2, M1 = (2n + 1)(1 1 θ)2, ρM = 1 2n, M2 = 0, ρF = 1, and m = 1. To prove that the B-SAGA is memory-biased, we need computing its expectation. f(xk) Ek[e B-SAGA k ] = f(xk) 1 θEk[ fjk(xk) fi(ϕjk k )] 1 i=1 fi(ϕi k) i=1 fi(ϕi k) . To compute a value for B1, we follow (21) to obtain i=1 E[ xk ϕi k 2] (2n + 1) xk xk 1 2 + 1 n(1 1 2n) Pn i=1 E[ xk 1 ϕi k 1 2] (2n + 1) Pk ℓ=1 (1 1 2n)k ℓ xℓ xℓ 1 2. Summing this inequality from k = 0 to k = T 1, we obtain i=1E[ xk ϕi k 2] (2n + 1) PT 1 ℓ=1(1 1 2n)k ℓ xℓ xℓ 1 2 ℓ=0(1 1 2n)ℓ PT 1 k=0 xk+1 xk 2 = 2n(2n + 1) PT 1 k=0 xk+1 xk 2, which completes the proof. Proof of Lemma 23 Suppose k {ms, ms + 1, m(s + 1) 1} for some s N0. As in the proof of Lemma 22, we begin with the inequality of Lemma 35 and consider two cases. Case 1. Suppose θ [1, 2], so that we may drop the second term in (20). We can bound the remaining term as follows. i=1 fi(xk) fi(ϕs) 2 i=1 fi(xk) fi(xk 1) 2 + 1+1/m i=1 fi(xk 1) fi(ϕs) 2 ℓ=ms (1 + 1 i=1 fi(xℓ+1) fi(xℓ) 2. Inequality 1 uses the inequality u w 2 (1 + 1/m) u v 2 + (1 + m) v w 2, and 2 follows from the fact that xms = ϕs. Summing this inequality from k = ms to k = m(s + 1) 1 gives i=1 fi(xk) fi(ϕs) 2 m + 1 m)m m(s+1) 1 X i=1 fi(xℓ+1) fi(xℓ) 2 m)m m(s+1) 1 X i=1 fi(xk+1) fi(xk) 2 k=ms fi(xk+1) fi(xk) 2. On Biased Stochastic Gradient Estimation The final inequality uses the fact that (1 + 1 m)m < limm (1 + 1 m)m = e < 3. From this inequality, it is clear that the B-SVRG gradient estimator satisfies the BMSE property with M1 = 3m(m+1) θ2 , ρM = 1, M2 = 0, and ρF = 1. Case 2. If θ > 2, then applying Jensen s inequality to (20) produces Ek[ e B-SVRG k f(xk) 2] 1 θ)2 Pn i=1 fi(xk) fi(ϕs) 2. A similar argument to the one in Case 1 shows that M1 = 3m(m + 1)(1 1 θ)2, ρM = 1, M2 = 0, and ρF = 1. All that is left is to prove the stated value for B1. Following the proof in Case 1, k=ms xk ϕs 2 Pm(s+1) 1 ℓ=ms (1 + m)(1 + 1 m)m xℓ+1 xℓ 2 3m(m + 1) Pm(s+1) 1 k=ms xk+1 xk 2. Summing over the epochs s = 0 to s = S shows B1 = 3m(m + 1). Combining Lemmas 22 and 23 with Theorems 17 and 20 proves convergence rates for B-SAGA and B-SVRG. Appendix E. Proof of convergence rates for SARAH Lemma 27 establishes the BMSE constants for the SARAH estimator. The convergence rates of Corollary 29 then follow immediately from Theorem 18. Proof of Lemma 27 Let k {ms + 1, ms + 2, , m(s + 1) 1}. The claim follows immediately from the well-known bound on the MSE of the SARAH gradient estimator e SARAH k f(xk) 2 1 Pn i=1 fi(xℓ+1) fi(xℓ) 2. We refer to Fang et al. (2018) for a proof of this inequality. Summing over an epoch and applying the estimate i=1 fi(xℓ+1) fi(xℓ) 2 m i=1 fi(xk+1) fi(xk) 2 completes the proof. Appendix F. Proof of convergence rates for SARGE For our analysis, we write the SARGE gradient estimator in terms of the SAGA estimator. Define the estimator e ξ-SAGA k def = fjk(xk 1) fjk(ξjk k ) + 1 Pn i=1 fi(ξi k), Driggs, Liang, Sch onlieb where the variables {ξi k}n i=1 follow the update rules ξjk k+1 = xk 1 and ξi k+1 = ξi k for all i = jk. The SARGE estimator is equal to e SARGE k = e SAGA k (1 1 n)(e ξ-SAGA k e SARGE k 1 ). Before we prove Lemma 28, we require a bound on the MSE of the ξ-SAGA gradient estimator that follows immediately from Lemma 35. Lemma 36 The MSE of the ξ-SAGA gradient estimator satisfies the following bound: E[ e ξ-SAGA k f(xk 1) 2] 3 Pk 1 ℓ=1 (1 1 2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2]. Proof Following the proof of Lemma 35, Ek[ e ξ-SAGA k f(xk 1) 2] = Ek[ fjk(xk 1) fjk(ξjk k ) f(xk 1) + 1 i=1 fi(ξi k) 2] i=1 fi(xk 1) fi(ξi k) 2 f(xk 1) 1 i=1 fi(ξi k) 2 i=1 fi(xk 1) fi(ξi k) 2. Equality 1 is the standard variance decomposition. To continue, we follow the proof of Lemma 35. E[ e ξ-SAGA k f(xk 1) 2] i=1 E[ fi(xk 1) fi(ξi k) 2] i=1 E[ fi(xk 1) fi(xk 2) 2] + 1 n(1 + 1 2n) Pn i=1 E[ fi(xk 2) fi(ξi k) 2] i=1 E[ fi(xk 1) fi(xk 2) 2] n(1 + 1 2n)(1 1 i=1 E[ fi(xk 2) fi(ξi k 1) 2] i=1 E[ fi(xk 1) fi(xk 2) 2] + 1 n(1 1 2n) Pn i=1 E[ fi(xk 2) fi(ξi k 1) 2] ℓ=1 (1 1 2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2]. Equality 2 follows from computing expectations, and 3 uses the estimate (1 1 Due to the recursive nature of the SARGE gradient estimator, its MSE depends on the difference between the current estimate and the estimate from the previous iteration. The next lemma provides a bound on E e SARGE k e SARGE k 1 2. Lemma 37 The SARGE gradient estimator satisfies the following bound: E[ e SARGE k e SARGE k 1 2] 12 i=1 E[ fi(xk) fi(xk 1) 2] + 3 2n2 E f(xk 1) e SARGE k 1 2 ℓ=1 (1 1 2n)k ℓPn i=1 E fi(xℓ) fi(xℓ 1) 2. Proof To begin, we use the standard inequality a c 2 (1+δ) a b 2+(1+δ 1) b c 2 for any δ > 0 twice. For simplicity, we set δ = p 3/2 1 and use the fact that 1+ 1 On Biased Stochastic Gradient Estimation for both applications of this inequality. E[ e SARGE k e SARGE k 1 2] = E[ e SAGA k (1 1 n)(e ξ-SAGA k e SARGE k 1 ) e SARGE k 1 2] 6E[ e SAGA k e ξ-SAGA k 2] + 2n2 E[ e ξ-SAGA k e SARGE k 1 2] 6E[ e SAGA k e ξ-SAGA k 2] + 6 2n2 E[ e ξ-SAGA k f(xk 1) 2] + 3 2n2 E[ f(xk 1) e SARGE k 1 2]. (22) We use 6 2n2 9 n2 to simplify the coefficient of the second term. We now bound the first two of these three terms separately. Consider the first term. 6E[ e SAGA k e ξ-SAGA k 2] = 6E[ fjk(xk) fjk(ϕjk k ) + 1 i=1 fi(ϕi k) fjk(xk 1) fjk(ξjk k ) 1 i=1 fi(ξi k) 2] 12E[ fjk(xk) fjk(xk 1) 2] + 12E[ fjk(ϕjk k ) fjk(ξjk k ) 1 i=1 fi(ϕi k) + 1 i=1 fi(ξi k) 2] 1 = 12E[ fjk(xk) fjk(xk 1) 2] + 12E[ fjk(ϕjk k ) fj(ξjk k ) 2] 12E[ 1 i=1 fi(ϕi k) 1 i=1 fi(ξi k) 2] i=1 E[ fi(xk) fi(xk 1) 2] + 12E[ fjk(ϕjk k ) fjk(ξjk k ) 2] i=1 E[ fi(xk) fi(xk 1) 2] + 12E[ fjk(ϕjk k ) fjk(ξjk k ) 2]. Equality 1 is the standard variance decomposition, which states that for any random variable X, E[ X EX 2] = E[ X 2] E[X] 2. The second term can be reduced further by computing the expectation. The probability that fjk(ϕjk k ) = fjk 1(xk 1) is equal to the probability that jk = jk 1, which is 1/n. The probability that fjk(ϕjk k ) = fjk 2(xk 2) is equal to the probability that jk = jk 1 and jk = jk 2, which is 1 n). Continuing in this way, E[ fjk(ϕjk k ) fjk(ξjk k ) 2] n E[ fjk 1(xk 1) fjk 1(xk 2) 2] + 1 n)E[ fjk 2(xk 2) fjk 3(xk 2) 2] + n)k ℓ 1E[ fjℓ(xℓ) fjℓ(xℓ 1) 2]. This implies that 12E[ fjk(ϕjk k ) fjk(ξjk k ) 2] 12 i=1 E[ fi(xℓ) fi(xℓ 1) 2] ℓ=1 (1 1 2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2]. We include the second inequality to simplify later arguments. This completes our bound for the first term of (22). For the second term of (22), we recall Lemma 36. E[ e ξ-SAGA k f(xk 1) 2] 3 Pk 1 ℓ=1 (1 1 2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2]. Combining all of these bounds, we obtain E[ e SARGE k e SARGE k 1 2] 12 i=1 E[ fi(xk) fi(xk 1) 2] + 3 2n2 E[ f(xk 1) e SARGE k 1 2] ℓ=1 (1 1 2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2], Driggs, Liang, Sch onlieb which completes the proof. Lemma 37 allows us to take advantage of the recursive structure of our gradient estimate. With this lemma established, we can prove a bound on the MSE. Lemma 38 The SARGE gradient estimator satisfies the following recursive bound: E[ e SARGE k f(xk) 2] n + 3 2n2 )E[ e SARGE k 1 f(xk 1) 2] + 12 i=1 E[ fi(xk) fi(xk 1) 2] ℓ=1 (1 1 2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2]. Proof The beginning of our proof is similar to the proof of the variance bound for the SARAH gradient estimator in (Nguyen et al., 2017, Lem. 2). Ek e SARGE k f(xk) 2 = Ek[ e SARGE k 1 f(xk 1) + f(xk 1) f(xk) + e SARGE k e SARGE k 1 2] = e SARGE k 1 f(xk 1) 2 + f(xk 1) f(xk) 2 + Ek[ e SARGE k e SARGE k 1 2] + 2 f(xk 1) e SARGE k 1 , f(xk) f(xk 1) 2 f(xk 1) e SARGE k 1 , Ek[e SARGE k e SARGE k 1 ] 2 f(xk) f(xk 1), Ek[e SARGE k e SARGE k 1 ] . We consider each inner product separately. The first inner product is equal to 2 f(xk 1) e SARGE k 1 , f(xk) f(xk 1) = f(xk 1) e SARGE k 1 2 f(xk) f(xk 1) 2 + f(xk) e SARGE k 1 2. For the next two inner products, we use the fact that Ek[e SARGE k e SARGE k 1 ] = Ek e SAGA k (1 1 n)e ξ-SAGA k + (1 1 n)e SARGE k 1 e SARGE k 1 = f(xk) (1 1 n) f(xk 1) 1 n e SARGE k 1 = f(xk) f(xk 1) + 1 n( f(xk 1) e SARGE k 1 ). With this equality established, we see that the second inner product is equal to 2 f(xk 1) e SARGE k 1 , Ek[e SARGE k e SARGE k 1 ] = 2 f(xk 1) e SARGE k 1 , f(xk) f(xk 1) 2 n f(xk 1) e SARGE k 1 , f(xk 1) e SARGE k 1 = f(xk 1) e SARGE k 1 2 + f(xk) f(xk 1) 2 f(xk) e SARGE k 1 2 2 n f(xk 1) e SARGE k 1 2 n) f(xk 1) e SARGE k 1 2 + f(xk) f(xk 1) 2 f(xk) e SARGE k 1 2. The third inner product can be bounded using a similar procedure. 2 f(xk) f(xk 1), Ek[e SARGE k e SARGE k 1 ] = 2 f(xk) f(xk 1), f(xk) f(xk 1) 2 n f(xk) f(xk 1), f(xk 1) e SARGE k 1 2 f(xk) f(xk 1) 2 + 1 n f(xk) f(xk 1) 2 + 1 n f(xk 1) e SARGE k 1 2 n) f(xk) f(xk 1) 2 + 1 n f(xk 1) e SARGE k 1 2, On Biased Stochastic Gradient Estimation where the inequality is Young s. Altogether and after applying the full expectation operator, we have E[ e SARGE k f(xk) 2] (1 1 n)E[ e SARGE k 1 f(xk 1) 2] n)E[ f(xk) f(xk 1) 2] + E[ e SARGE k e SARGE k 1 2] n)E[ e SARGE k 1 f(xk 1) 2] + E[ e SARGE k e SARGE k 1 2]. Finally, we bound the last term on the right using Lemma 37. E e SARGE k f(xk) 2 n + 3 2n2 )E[ e SARGE k 1 f(xk 1) 2] + 12 i=1 E[ fi(xk) fi(xk 1) 2] ℓ=1 (1 1 2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2] This completes the proof. Proof of Lemma 28 It is easy to see that ρB = 1/n by computing the expectation of the SARGE gradient estimator. f(xk) Ek[e SARGE k ] = f(xk) Ek[e SAGA k (1 1 n)(e ξ-SAGA k e SARGE k 1 )] n)( f(xk 1) e SARGE k 1 ). The result of Lemma 38 makes it clear that M1 = 12. To determine ρM, we must first choose a suitable sequence Mk. Let Mk = E[ e SARGE k f(xk) 2]. If n = 1, then Mk = 0 for all k, so it holds trivially that Mk (1 ρM)Mk 1. If n 2, then 1 1 n + 3 2n2 1 1 4n, so Lemma 38 ensures that with ρM = 1 4n, Mk (1 ρM)Mk 1. Finally, we must compute M2 and ρF with respect to some sequence Fk. Lemma 38 motivates the choice Fk = Pk 1 ℓ=1 (1 1 2n)k ℓ 1 Pn i=1 E[ fi(xℓ) fi(xℓ 1) 2], and the choices M2 = 39 n and ρF = 1 2n are clear. Appendix G. Incorporating Bias to Lower the MSE: An Example From (5), we argue that biased stochastic gradient estimators can admit better convergence guarantees than unbiased estimators if the bias reduces the total effect of the estimator s MSE and inner-product bias term. In this section, we compare the effects of these terms for the SARAH and SVRG gradient estimators, revealing why SARAH admits better convergence rates. Beginning with the SARAH estimator, equation (5) reduces to (17), which we reproduce: ηEk[F(xk+1) F(x ) + δ(F(xk+1) F(xk))] 2Ek[ xk+1 x 2] + 1 2 xk x 2 + Ek h η(1+δ) 2Lλ e k f(xk) 2 + (1 + δ)( ηL(λ+1) 2) xk+1 xk 2 + η(1 ρB) f(xk 1) e k 1, xk x i . Driggs, Liang, Sch onlieb Following the proof of Theorem 18, we set λ = 2Θ, which optimizes the step size. The effect of the MSE and bias term over an epoch is then k=ms+1 η(1+δ) 2L 2ΘE[ f(xk) e k 2 + η(1 ρB) f(xk 1) e k 1, xk x ] 1 Pm(s+1) 1 k=ms+1 E[( η(1+δ) 2Θ + mϵη(1 ρB) 2 ) f(xk) e k 2 + mη(1 ρB) 2ϵ xk+1 xk 2] 2 Pm(s+1) 1 k=ms+1 E[( η(1+δ) 2Θ + mϵη(1 ρB) i=1 fi(xk+1) fi(xk) 2 + mη(1 ρB) 2ϵ xk+1 xk 2] 3 Pm(s+1) 1 k=ms+1 E[L2m( η(1+δ) 2Θ + mϵη(1 ρB) 2 ) xk+1 xk 2 + mη(1 ρB) 2ϵ xk+1 xk 2] Inequality 1 is an application of Lemma 15, 2 is Lemma 27, and 3 uses the Lipschitz continuity of fi. Setting ϵ = (L2n) 1/2 to minimize this bound, setting δ = m 1, and choosing m = O(n) gives a coefficient of O(n3/2Lη). In contrast, B-SVRG admits a larger bound on these terms. For memory-biased estimators, equation (5) reduces to (c.f. (11)) ηEk[F(xk+1) F(x ) + δ(F(xk+1) F(xk))] 2Lλ Ek[ e k f(xk) 2] 1 2Ek[ xk+1 x 2] + 1 + ( ηL(1+δ)(λ+1) 2 )Ek[ xk+1 xk 2] + ηL i=1 xk ϕi k 2. The bias term leads to terms of the form xk ϕi k 2. Therefore, using λ = 2Θ, our goal is to minimize k=ms E[ η(1+δ) 2Θ f(xk) e k 2 + ηL i=1 xk ϕi k 2] 2ΘMms + Pm(s+1) 1 i=1 xk ϕi k 2] 2 Pm(s+1) 1 k=ms E[ M1ηL(1+δ) 2Θ xk+1 xk 2 + ηL i=1 xk ϕi k 2] 3 Pm(s+1) 1 k=ms E[ M1ηL(1+δ) 2Θ xk+1 xk 2 + ηLB1 θ)E xk+1 xk 2]. Inequalities 1 and 2 use the fact that the SVRG estimator satisfies the BMSE property with ρM = 1. Inequality 3 uses the definition of a memory-biased estimator. With δ = max{B1(1 1/θ)/ 2Θ 1, 0}, B1 = 3m(m + 1), Θ = M1, and θ2 θ (0, 2], 3m(m + 1)(1 1 this gives a coefficient of O(Lm2η). This difference of O(n1/2) between the two bounds is significant. It manifests as an improvement of n in the convergence rate of SARAH over the rate of B-SVRG. Zeyuan Allen-Zhu. Natasha: Faster non-convex stochastic optimization via strongly nonconvex parameter. In Proceedings of the 34th International Conference on Machine Learning, 2017. On Biased Stochastic Gradient Estimation Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research, pages 1 51, 2018a. Zeyuan Allen-Zhu. Katyusha X: Practical momentum method for stochastic sum-ofnonconvex optimization. In Proceedings of the 35th International Conference on Machine Learning, 2018b. Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, 2018c. Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning, 2016. Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for non-strongly-convex or sum-ofnon-convex objectives. In Proceedings of the 35th International Conference on Machine Learning, 2018. Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009. Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006. L eon Bottou, Frank E. Curtis, , and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60:223 311, 2018. Kristian Bredies and Dirk Lorenz. Mathematical Image Processing. Springer, 2018. Antonin Chambolle, Matthias J. Ehrhardt, Peter Richt arik, and Carola-Bibiane Sch onlieb. Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim., 28(4):2783 2808, 2018. Aaron Defazio. A simple practical accelerated method for finite sums. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, 2016. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646 1654, 2014a. Aaron Defazio, Tiberio Caetano, and Justin Domke. Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the 31st International Conference on Machine Learning, 2014b. Derek Driggs, Matthias Ehrhardt, and Carola-Bibiane Sch onlieb. Accelerating variancereduced gradient methods. Mathematical Programming, 2020. Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal nonconvex optimization via stochastic path integrated differential estimator. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, 2018. Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ar Xiv:1509.05647v4, 2015. Driggs, Liang, Sch onlieb Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian Mc Williams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, volume 28, pages 2296 2304, 2015. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013. Guanghui Lan, Zhize Li, and Yi Zhou. A unified variance-reduced accelerated gradient method for convex optimization. ar Xiv:1905.12412, 2019. Jingwei Liang, Jalal Fadili, and Gabriel Peyr e. Activity identification and local linear convergence of Forward Backward-type methods. SIAM Journal on Optimization, 27(1): 408 437, 2017. Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances In Neural Information Processing Systems, 2015. Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964 979, 1979. Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. Technical report, 2014. Yurii Nesterov. Introductory lectures on convex programming. Springer, 2004. Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak aˆc. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2613 2621, 2017. Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran-Dinh. Prox SARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. ar Xiv:1902.05679, 2019. Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnab as P oczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proc. 33rd International Conference on Machine Learning, 2016a. Sashank J. Reddi, Suvrit Sra, Barnab as P oczos, and Alex Smola. Fast stochastic methods for nonsmooth nonconvex optimization. In Proc. 30th Annual Conference on Neural Information Processing Systems, 2016b. Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22(3):400 407, 1951. Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in neural information processing systems, pages 2663 2671, 2012. Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83 112, 2017. On Biased Stochastic Gradient Estimation Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567 599, 2013. Fanhua Shang, Licheng Jiao, Kaiwen Zhou, James Cheng, Yan Ren, and Yufei Jin. ASVRG: Accelerated proximal SVRG. In Asian Conference on Machine Learning, volume 95, pages 1 32, 2018. Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spider Boost: A class of faster variance-reduced algorithms for nonconvex optimization. ar Xiv:1810.10690, 2018. Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, 2019. Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. Technical report, Microsoft Research, 2014. Kaiwen Zhou, Fanhua Shang, and James Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In Proceedings of the 35th International Conference on Machine Learning, pages 5975 5984, 2018.