# towards_noiseadaptive_problemadaptive_accelerated_stochastic_gradient_descent__e2dbc731.pdf Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Sharan Vaswani 1 Benjamin Dubois-Taine 2 Reza Babanezhad 3 We aim to make stochastic gradient descent (SGD) adaptive to (i) the noise σ2 in the stochastic gradients and (ii) problemdependent constants. When minimizing smooth, strongly-convex functions with condition number κ, we prove that T iterations of SGD with exponentially decreasing step-sizes and knowledge of the smoothness can achieve an O exp ( T/κ) + σ2/T rate, without knowing σ2. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD with SLS converges at the desired rate, but only to a neighbourhood of the solution. On the other hand, we prove that SGD with an offline estimate of the smoothness converges to the minimizer. However, its rate is slowed down proportional to the estimation error. Next, we prove that SGD with Nesterov acceleration and exponential step-sizes (referred to as ASGD) can achieve the nearoptimal O exp ( T/ κ) + σ2/T rate, without knowledge of σ2. When used with offline estimates of the smoothness and strong-convexity, ASGD still converges to the solution, albeit at a slower rate. Finally, we empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS. 1. Introduction We study unconstrained minimization of a finite-sum objective f : Rd R prevalent in machine learning, min w Rd f(w) := 1 i=1 fi(w). (1) 1Simon Fraser University 2DI ENS, Ecole normale sup erieure, Universit e PSL, CNRS, INRIA, 75005 Paris, France 3SAIT AI lab, Montreal. Correspondence to: Sharan Vaswani . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). For supervised learning, n represents the number of training examples and fi is the loss on example i. We assume f to be a smooth, strongly-convex function and denote w to be the unique minimizer of the above problem. We study stochastic gradient descent (SGD) and its accelerated variant for minimizing f (Robbins and Monro, 1951; Nemirovski and Yudin, 1983; Nesterov, 2004; Bottou et al., 2018). The empirical performance and the theoretical convergence of SGD is governed by the choice of its step-size, and there are numerous ways of selecting it. For example, Moulines and Bach (2011); Gower et al. (2019) use a constant step-size for convex and strongly convex functions. A constant step-size only guarantees convergence to a neighborhood of the solution. In order to converge to the exact minimizer, a common technique is to decrease the step-size at an appropriate rate, and such decreasing stepsizes have also been well-studied (Robbins and Monro, 1951; Ghadimi and Lan, 2012). The rate at which the stepsize needs to be decayed depends on the function class under consideration. For example, when minimizing smooth, strongly-convex functions using T iterations of SGD, the step-size is decayed at an O(1/k) rate where k is the iteration number. This results in an Θ(1/T) convergence rate for SGD and is optimal in the stochastic setting (Nguyen et al., 2018). On the other hand, when minimizing a smooth, stronglyconvex function with condition number κ, deterministic (full-batch) gradient descent (GD) with a constant step-size converges linearly at an O(exp( T/κ)) rate. Augmenting constant step-size GD with Nesterov acceleration can further improve the convergence rate to Θ(exp( T/ κ)) which is optimal in the deterministic setting (Nesterov, 2004). Hence, the stochastic and deterministic algorithms use different step-size strategies to obtain the optimal rates in their respective settings. Noise-adaptive SGD: Ideally, we want to design stepsize schemes that make SGD adaptive to the noise in the stochastic gradients, matching the optimal convergence rates in both the deterministic and stochastic settings. Furthermore, in order for the algorithm to be practical, it should not require knowledge of the stochasticity (e.g. a bound on σ2, the variance in stochastic gradients). Recently, Khaled and Richt arik (2020); Li et al. (2020) Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD achieve the O exp( T/κ) + σ2/T for smooth functions satisfying the Polyak-Lojasiewicz (PL) condition (Karimi et al., 2016), a generalization of strong-convexity. More importantly, these works are noise-adaptive and do not require the knowledge of σ2. For this, Li et al. (2020) use SGD with an exponentially decreasing sequence of stepsizes, while Khaled and Richt arik (2020) use a constant then decaying step-size. There are two limitations with these works: (i) they require the knowledge of problemdependent constants such as the smoothness and strongconvexity of the underlying function, and (ii) they do not match the optimal κ dependence (of the Nesterov accelerated method) in the linear convergence term, and are hence sub-optimal in the deterministic setting. We will address both these limitations in this work. Towards noise and problem-adaptive SGD: Typically, SGD requires the knowledge of problem-dependent constants to set the step-size. In practice, it is difficult to estimate these quantities, and one can only obtain loose bounds on them. Consequently, there have been numerous methods (Duchi et al., 2011; Li and Orabona, 2019; Kingma and Ba, 2015; Bengio, 2015; Vaswani et al., 2019b; Loizou et al., 2021) that can adapt to the problem, and adjust the step-size on the fly. We term such methods as problemadaptive. Unfortunately, it is unclear if such problemadaptive methods can also be made noise-adaptive. On the other hand, as mentioned above, none of the noiseadaptive methods (Li et al., 2020; Khaled and Richt arik, 2020; Stich, 2019) are problem-adaptive. Amongst these, the noise-adaptive algorithm in Li et al. (2020) only requires knowledge of the smoothness constant and we try to relax this requirement. Contribution: In Section 3.2, we use stochastic line-search (SLS) (Vaswani et al., 2019b) to estimate the smoothness constant on the fly. We prove that SGD in conjunction with exponentially decreasing step-sizes and SLS converges at the desired noise-adaptive rate but only to a neighbourhood of the solution. This neighbourhood depends on the noise and the error in estimating the smoothness. We prove a corresponding lower-bound that shows the necessity of this neighbourhood term. Our lower-bound shows that if the SGD step-size is adaptively set in an online fashion (using the sampled function), no decreasing sequence of step-sizes can converge to the minimizer. Contribution: In Section 3.3, we consider estimating the smoothness constant in an offline fashion (before running the algorithm). We prove that SGD with an offline estimate of the smoothness and exponentially decreasing step-sizes converges to the solution, though its rate is slowed down by a factor proportional to the estimation error in the smoothness. In particular, our upper-bound shows that misestimating the smoothness constant can slow down the conver- gence rate. We complement this result with a lower-bound that shows that this slowdown is unavoidable. Our results thus demonstrate the difficulty of obtaining noise-adaptive rates while being adaptive to problemdependent parameters. Noise-adaptive SGD with Nesterov acceleration: We now turn to the second limitation of existing noise-adaptive methods, and aim to use Nesterov acceleration in order to obtain the optimal O exp( T/ κ) + σ2/T rate, without the knowledge of σ2. The work in Jain et al. (2018); Arjevani et al. (2020) satisfies the desired criteria for quadratic functions. For general smooth, stronglyconvex functions, Ghadimi and Lan (2013); Kulunchakov and Mairal (2019) obtain the desired rate, but require the knowledge of σ2, and are consequently not noise-adaptive. Aybat et al. (2019) propose a multi-stage accelerated algorithm that does not require knowledge of σ2. The authors use a dynamical systems analysis, and prove that their algorithm achieves the desired optimal rate only for T 2 κ. Contribution: In contrast, in Section 4, we use SGD with a stochastic variant of Nesterov acceleration (Cohen et al., 2018; Vaswani et al., 2019a) and the same exponentially decreasing step-sizes. We refer to the resulting method as Accelerated SGD (ASGD). Compared to Aybat et al. (2019), ASGD is a more natural extension of the deterministic Nesterov accelerated gradient method. Under a growth condition (Li et al., 2020; Khaled and Richt arik, 2020; Bottou et al., 2018) similar to Aybat et al. (2019), we use the standard estimating sequences analysis, and prove that ASGD achieves the desired rate for all T without the knowledge of σ2. Hence, exponentially decreasing stepsizes result in noise-adaptivity for both SGD and ASGD. Contribution: ASGD requires the knowledge of both the smoothness and strong-convexity parameters. As a step towards problem-adaptivity for ASGD, we analyze its convergence with offline estimates of these problem-dependent constants. To the best of our knowledge this is the first such result. We prove that, similar to SGD, misspecified ASGD converges to the minimizer, but its rate is slowed down by a factor proportional to the estimation errors. Contribution: Finally, in Section 5, we evaluate the performance of different step-size schemes on strongly-convex supervised learning problems. We show that (A)SGD consistently out-perform existing noise-adaptive algorithms. We propose a novel variant of SLS that guarantees convergence to the minimizer and demonstrate its practical effectiveness in making (A)SGD problem-adaptive. Additional contributions: In Appendix B.1.1, we show matching results for SGD on strongly star-convex functions (Hinder et al., 2020), a class of structured nonconvex functions. Finally, we prove upper-bounds for non- Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD strongly-convex functions (Appendix B.1.2) and show that even when the smoothness constant is known, exponentially decreasing step-sizes converge to a neighbourhood of the solution. We give some justification as to why polynomial or exponentially decreasing step-sizes are unlikely to be noise-adaptive in this setting. 2. Problem setup and Background We assume that f and each fi are differentiable and lowerbounded by f and f i , respectively. Throughout the paper, we assume that f is µ-strongly convex, and each fi is convex. We also assume that each function fi is Li-smooth, implying that f is L-smooth with L := maxi Li (see Appendix A for the necessary definitions) and define κ := L We use stochastic gradient descent (SGD) or SGD with Nesterov acceleration (Nesterov, 2004) (referred to as ASGD) to minimize f in Eq. (1). In each iteration k [T], SGD selects a function fik (typically uniformly) at random, computes its gradient and takes a descent step. Specifically, wk+1 = wk γkαk fik(wk), (2) where wk+1 and wk are the SGD iterates, and fik( ) is the gradient of the loss function chosen at iteration k. Each stochastic gradient fik(w) is unbiased, implying that Ei [ fi(w)] = f(w). The product of scalars ηk := γkαk defines the step-size for iteration k. The step-size consists of two parts γk, a problem-dependent scaling term that captures the (local) smoothness of the function; and αk, a problem-independent term that controls the decay of the step-size. Typically, αk is a decreasing sequence of k, and limk αk = 0. The choice of the αk sequence depends on the properties of f, for example, for smooth, stronglyconvex functions, αk is typically set to be O(1/k). Throughout the paper, we will assume that T is known in advance. In order to obtain noise-adaptive rates, we consider exponentially decreasing step-sizes (Li et al., 2020) of the form αk := αk where α := h β T i1/T 1 for a constant β 1. These step-sizes lie between the constant step-size used in the deterministic setting and the 1/k decreasing step-sizes used in the stochastic setting, meaning that for k [T], αk 1 In the next section, we analyze the convergence of SGD with exponentially decreasing step-sizes for smooth, strongly-convex functions. 3. Towards noise & problem adaptive SGD In this section, we consider approaches for developing noise and problem-adaptive SGD i.e. we aim to obtain the noise-adaptive rate matching Stich (2019); Li et al. (2020); Khaled and Richt arik (2020), but do so without the knowledge of problem-dependent constants. Instead of the typical assumption of finite gradient noise z2 := Ei[ fi(w ) 2] < , we assume a finite optimal objective difference. Specifically, we define the noise as σ2 := Ei[fi(w ) f i ] 0. This notion of noise has been used to study the convergence of constant step-size SGD in the interpolation setting for over-parameterized models (Zhang and Zhou, 2019; Loizou et al., 2021; Vaswani et al., 2020). Note that when interpolation is exactly satisfied, σ = z = 0. In general, if each function fi is µstrongly convex and L-smooth, then 1 2Lz2 σ2 1 2µz2. As a warm-up, we first assume knowledge of the smoothness constant in Section 3.1 and analyze the resulting SGD algorithm with exponentially decreasing stepsizes. In Section 3.2, we consider using a stochastic linesearch (Vaswani et al., 2019b; 2020) in order to estimate the smoothness constant and set the step-size on the fly. Finally, in Section 3.3, we analyze the convergence of SGD when using an offline estimate of the smoothness. 3.1. Known smoothness We use the knowledge of smoothness to set the problemdependent part of the step-size for SGD, specifically, γk = 1/L. With an exponentially decreasing αk-sequence, we prove the following theorem in Appendix C.1. Theorem 1. Assuming (i) convexity and Li-smoothness of each fi, (ii) µ strong-convexity of f, SGD (Eq. (2)) with γk = 1 L, αk = β T k/T converges as, E w T +1 w 2 w1 w 2 c2 exp T κ α ln(T/β) µe2 (ln(T/β))2 where c2 = exp 1 κ 2β ln(T/β) . Compared to Moulines and Bach (2011) that use polynomially decreasing step-sizes, exponential step-sizes result in a better trade-off between the bias (initial distance to the minimizer) and variance (noise) terms, achieving the desired O exp( T/κ) + σ2/T noise-adaptive rate. In Lemmas 3 and 4 in Appendix B.3, we show that no polynomially decreasing step-size can result in the desired noise-adaptive rate. In order to interpolate between the stochastic (minibatch size equal to 1) and fully deterministic (mini-batch size equal to n) setting, we show the explicit dependence of σ2 on the mini-batch size in Appendix B.2. Since strongly-convex functions also satisfy the PL condition (Karimi et al., 2016), the above result can be deduced Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD from (Li et al., 2020). However, unlike (Li et al., 2020), our result does not require the growth condition and uses a weaker notion of noise. Moreover, we use a different proof technique, specifically, Li et al. (2020) use the smoothness inequality in the first step and obtain the rate in terms of the function suboptimality, E[f(w T ) f ]. In contrast, our proof uses an expansion of the iterates to obtain the rate in terms of the distance to the minimizer, E w T +1 w 2. This change allows us to easily handle the case when the smoothness constant is unknown and needs to be estimated. Next, we use stochastic line-search techniques to estimate the unknown smoothness and set the step-size on the fly. 3.2. Online estimation of unknown smoothness In this section, we assume that the smoothness constant is unknown, and aim to estimate it and set the step-size in an online fashion. By online estimation, we mean that in iteration k, we use the knowledge of the sampled function ik to set the step-size, i.e. setting γk depends on ik. We only consider methods that use the knowledge of ik in iteration k and are not allowed to access the other functions in f (for example, to compute the full-batch gradient at wk). Methods based on a stochastic line-search (Vaswani et al., 2019b; 2020) or the stochastic Polyak step-size (Loizou et al., 2021; Berrada et al., 2020) satisfy this criterion. We use stochastic line-search (SLS) to estimate the local Lipschitz constant and set γk, the problem-dependent part of the step-size. SLS is the stochastic analog of the traditional Armijo line-search (Armijo, 1966) used for deterministic gradient descent (Nocedal and Wright, 2006). In iteration k, SLS estimates the smoothness constant Lik of the sampled function using fik and fik. In particular, starting from a guess (γmax) of the step-size, SLS uses a backtracking procedure and returns the largest step-size γk that satisfies: γk γmax and, fik(wk γk fik(wk)) fik(wk) c γk fik(wk) 2 . (3) Here, c (0, 1) is a hyper-parameter to be determined theoretically. SLS guarantees that resulting the step-size γk lies in the h min n 2(1 c) Lik , γmax o , γmax i range (Lemma 8). If the initial guess is large enough i.e. γmax > 1/Lik, then the resulting step-size γk 2(1 c) Lik . Thus, with c = 1/2, SLS can be used to obtain an upper-bound on 1/Lik. In the interpolation (σ = 0) setting, a constant stepsize (αk = 1 for all k) suffices, and SGD with SLS achieves a linear rate of convergence (for c 1/2) when minimizing smooth, strongly-convex functions (Vaswani et al., 2019b). In general, for a non-zero σ, using SGD with SLS and no step-size decay (αk = 1) results in O exp( T/κ) + γmaxσ2 rate (Vaswani et al., 2020), implying convergence to a neighbourhood determined by the γmaxσ2 term. In order to obtain a similar rate as Theorem 1 but without the knowledge of L, we set γk with SLS and use the same exponentially decreasing αk-sequence. We prove the following theorem in Appendix C.2. Theorem 2. Under the same assumptions as Theorem 1, SGD (Eq. (2)) with αk = β T k/T , γk as the largest step- size that satisfies γk γmax and Eq. (3) with c = 1/2 converges as, E w T +1 w 2 w1 w 2 c1 exp T κ α ln(T/β) + 8σ2c1(κ )2γmax e2 (ln(T/β))2 + 2σ2c1κ ln(T/β) γerr where γerr := γmax min γmax, 1 κ := max n L µ , 1 µγmax o , c1 = exp 1 κ 2β ln(T/β) . We observe that the first two terms are similar to those in Theorem 1. For γmax 1 L, κ = κ and the above theorem implies the same O exp( T/κ) + σ2 T rate of convergence. However, as T , w T +1 does not converge to w , but rather to a neighbourhood determined by the last term 2 σ2κ c1 ln(T/β) eα γmax min γmax, 1 L . The neighbourhood thus depends on the noise σ2 and γerr, the estimation error (in the smoothness) of the initial guess. When σ2 = 0, this neighbourhood term disappears, and SGD converges to the minimizer despite the estimation error. This matches the result for SLS in the interpolation setting (Vaswani et al., 2019b). Conversely, when the smoothness is known and γmax can be set equal to 1 L, we also obtain convergence to the minimizer and recover the result of Theorem 1. In fact, if we can guess a value of γmax 1 L, it would result in the neighbourhood term becoming zero, thus ensuring convergence to the minimizer. In this case, the stochastic line-search does not decrease the step-size in any iteration, and the algorithm becomes the same as using a constant step-size equal to γmax. Finally, we contrast our result with the αk = 1 setting (Vaswani et al., 2020), and observe that instead of the dependence on γmax, our neighbourhood term depends on the estimation error in the smoothness. Next, we show the necessity of such a neighbourhood term. 3.2.1. LOWER BOUND ON QUADRATICS In order to prove a lower-bound, we consider a pair of 1-dimensional quadratics fi(w) = 1/2 (xiw yi)2 for Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD i = 1, 2. Here, w, xi, yi are all scalars. The overall function to be minimized is f(w) = (1/2) [f1(w) + f2(w)]. We assume that x1 = x2 , and since Li = xi 2, this assumption implies different smoothness constants for the two functions. For a sufficiently large value of γmax i.e. γmax 1 mini [2] Li , using SLS with c 1/2 (required for convergence) results in γk 1/Lik1 (see Lemma 8). With these choices, we prove the following lower-bound. Theorem 3. When using T iterations of SGD to minimize the sum f(w) = f1(w)+f2(w) 2 of two onedimensional quadratics, f1(w) = 1 2(w 1)2 and f2(w) = 1 2 (2w + 1/2)2, setting γk using SLS with γmax 1 and c 1/2, any convergent sequence of αk results in convergence to a neighbourhood of the solution. Specifically, if w is the minimizer of f and w1 > 0, then, E(w T w ) min w1, 3 The above result (proved in Appendix D.1) shows that using SGD with SLS to set γk and any convergent sequence of αk (including the exponentially-decreasing sequence in Theorem 2) will necessarily result in convergence to a neighbourhood. The neighbourhood term can thus be viewed as the price of misestimation of the unknown smoothness constant. This result is in contrast to the conventional thinking that choosing an αk sequence such that limk αk = 0 will always ensure convergence to the minimizer. Note that this result is not specific to SLS and would hold for other related methods (Loizou et al., 2021; Berrada et al., 2020). Since the lower-bound holds for any convergent αk sequence, a possible reason for this convergence to the neighbourhood is the correlation between ik and the computation of γk. We investigate this hypothesis in the next section. 3.3. Offline estimation of unknown smoothness In this section, we consider an offline estimation of the smoothness constant. By offline, we mean that in iteration k, γk is set before sampling ik and cannot use any information about it. This ensures that γk is decorrelated with the sampled function ik. The entire sequence of γk can even be chosen before running SGD. For simplicity of calculations, we consider a fixed γk = γ for all iterations. Here γ is an offline estimate of 1 L, and can be obtained by any method. Without loss of generality, we assume that this offline estimate is off by a multiplicative factor ν that is γ = ν L for some ν > 0. Here ν quantifies 1For 1-dimensional quadratics, γk = 1/Lik for c = 1/2. the estimation error in γ with ν = 1 corresponding to an exact estimation of L. In practice, it is typically possible to obtain lower-bounds on the smoothness constant. Hence, the ν > 1 regime is of practical interest. For SGD with γk = γ = ν L and an exponentially decreasing αk-sequence, we prove the following theorem in Appendix C.3. Theorem 4. Under the same assumptions as Theorem 1, SGD (Eq. (2)) with αk = β T k/T , γk = ν L converges as, T +1 1 c2 exp min{ν, 1} T κ α ln(T/β) + max{ν2, 1}8c2κ ln(T/β) µ e2 α2 T 2σ2 ln(T/β) + G [ln(ν)]+ where c2 = exp 1 κ 2β ln(T/β) , [x]+ = max{x, 0}, k0 = ln(T/β) , G = maxj [k0]{f(wj) f } and k := The above theorem implies an O exp min{ν,1} T κ + max{ν2,1}[σ2+G[ln(ν)]+] vergence to the minimizer. The first two terms are similar to that in Theorem 1 and imply an O exp( T/κ) + σ2 convergence to the minimizer. Analyzing the third term, we observe that when ν 1, the third term is zero (since [ln(ν)]+ = 0), and the rate matches that of Theorem 1 up to constants that depend on ν. The third term depends on [maxj k0{f(wj) f }] because if ν > 1, the step-size γkαk = ν L αk 1 L initially, and SGD diverges in this regime. Since αk is an exponentially decreasing sequence, after k0 := T ln(ν) ln(T/β) iterations, ν L αk 1 L, the distance to the minimizer decreases after iteration k0, eventually converging to the solution. Furthermore, observe that the second term depends on O max{ν2, 1} meaning that if we misestimate the smoothness constant by a multiplicative factor of ν > 1, it can slow down the convergence rate by an O(ν2) factor. Finally, our theorem implies that even in the deterministic setting, misestimating L can slowdown the convergence rate to O ν2 T instead of the usual linear rate of convergence. The third term can thus be viewed as the price of misestimation of the unknown smoothness constant. Unlike Theorem 2 where this price was convergence to a neighbourhood, here, the price of misestimation is slower convergence to the minimizer. Moulines and Bach (2011) also considered the effect of misspecifying L but in conjunction with polynomially decreasing step-sizes. Specifically, they proved that using a step-size of ν L 1 T θ results in the following bounds that depend on γ and ν (Moulines and Bach, 2011, Theorem 1). Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Below, we show their bounds for three common choices of θ = {0, 1/2, 1} and emphasize the effect of ν. T +1 = O exp (ν2 ν/κ) T ( 1 + σ2) + νσ2 (When θ = 0) = O exp ν2 ln(T) ν/κ T ( 1 + σ2) + νσ2 (When θ = 1 = O exp ν2 ν/κ ln(T) ( 1 + σ2) + ν2σ2 (When θ = 1 and ν < 2κ) = O exp ν2 ν/κ ln(T) ( 1 + σ2) + ν2σ2 (When θ = 1 and ν 2κ) Observe that for each regime, the convergence rate depends on exp(ν). In contrast, the convergence rate in Theorem 4 depends on O(ν2). This robustness towards misspecification can be viewed as an additional advantage of using exponentially decreasing step-sizes. In the next section, we justify the dependence on [ln(ν)]+ in Theorem 4 by proving a corresponding lower-bound. It is unclear whether the ν2 dependence in Theorem 4 is tight, and we leave verifying this for future work. 3.3.1. LOWER BOUND ON QUADRATICS In this section, we consider gradient descent on a onedimensional quadratic and study the effect of misestimating the smoothness constant by a factor of ν > 1. We consider minimizing a single quadratic, ensuring that σ2 = 0 and prove the following lower-bound in Appendix D.2. Theorem 5. When minimizing a one-dimensional quadratic function f(w) = 1 2(xw y)2, GD with αk = β T k/T , γk = ν L for ν > 3, satisfies wk+1 w = (w1 w ) i=1 (1 ναi). After k := T ln(T/β) ln ν 3 iterations, we have that |wk +1 w | 2k |w1 w |. Instantiating this lower-bound, suppose the estimate of L is off by a factor of ν = 10, then ln ν 3 1, which implies that k T ln(T/β) . In other words, we do not make any progress in the first T ln(T/β) iterations, and at this point the optimality gap has been multiplied by a factor of 2T/ ln(T/β) compared to the starting optimality gap. This simple example shows the slowdown in the rate of convergence by misestimating the smoothness. 4. Towards noise & problem adaptive ASGD In this section, we will first aim to use SGD with Nesterov acceleration and obtain the optimal O exp T κρ + σ2 rate without knowledge of σ2. Subsequently, we will analyze the convergence of ASGD with offline estimates of the smoothness and strong-convexity parameters, quantifying the price of misspecification (similar to Section 3.3). ASGD has two sequences {wk, yk} and an additional extrapolation parameter bk. ASGD computes the stochastic gradient at the extrapolated point yk and takes a descent step in that direction. The update in iteration k is: yk = wk + bk (wk wk 1), (4) wk+1 = yk γkαk fik(yk). (5) For analyzing the convergence of ASGD, we will assume that the stochastic gradients satisfy a growth condition similar to Bottou et al. (2018); Li et al. (2020); Khaled and Richt arik (2020)2 there exists a (ρ, σ) with ρ 1 and σ 0, such that for all w, Ei fi(w) 2 ρ f(w) 2 + σ2. (6) In the deterministic setting (when using the full-gradient in Eq. (5)), ρ = 1 and σ = 0. Similarly, σ = 0 when the stochastic gradients satisfy the strong-growth condition when using over-parameterized models (Schmidt and Roux, 2013; Vaswani et al., 2019a). 3 Using this growth condition, we prove the following result in Appendix E.3. Theorem 6. Under the same assumptions of Theorem 1 and (iii) the growth condition in Eq. (6), ASGD (Eqs. (4) and (5)) with w1 = y1, γk = 1 ρL, αk = β T k/T , rk = q µ ρL β T k/2T and bk = (1 rk 1) rk 1 α rk+r2 k 1 α converges as, T +1 2c3 exp T κρ α ln(T/β) ρµe2 (ln(T/β))2 where k := E[f(wk) f ] and c3 = exp 2β ρκ ln(T/β) . 2The growth condition in Bottou et al. (2018, (Eq 4.8) ) is Ei f(w) fi(w) 2 MV f(w) 2 + M for some constants M and MV . Since the stochastic gradient is unbiased, this is the same as Eq. (6) with ρ = MV + 1 and σ2 = M. 3Note that the noise-model in Section 3 is equivalent (upto constants) to assuming that the variance of the stochastic gradients at the optimum is bounded. However, the noise-model in Eq. (6) is equivalent to assuming that the variance of the stochastic gradients at any iterate is bounded. Hence, the two noise-models are similar, though Eq. (6) makes a stronger assumption on the noise. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD The above theorem implies that ASGD achieves an O exp T κρ + σ2 T convergence rate. This improves over the non-accelerated O exp T T noiseadaptive rate obtained in Theorem 1 and Stich (2019); Khaled and Richt arik (2020); Li et al. (2020)4. In the fullydeterministic setting (ρ = 1 and σ = 0), Theorem 6 implies an O(exp( T/ κ)) convergence to the minimizer, matching the optimal rate in the deterministic setting (Nesterov, 2004). Under the strong-growth condition (when σ = 0), ASGD improves (by a ρ factor) over the rate in Vaswani et al. (2019a) and matches (upto log factors) the rate in Mishkin (2020). In the general stochastic case, when σ = 0, Cohen et al. (2018); Vaswani et al. (2019a) use constant step-sizes (αk = 1), and prove convergence to a neighbourhood of the solution; whereas we show convergence to the minimizer at a rate governed by the O(σ2/T) term. To smoothly interpolate between the stochastic (batch size equal to 1) and fully deterministic (batch size equal to n) setting, we generalize the growth condition to show an explicit dependence on the batch size (Appendix B). Comparing our result to that in Aybat et al. (2019), we note that they prove the accelerated noise-adaptive rate under two different assumptions on the noise. (i) Bounded variance of the stochastic gradients. This is a special case of the growth condition in Eq. (6) with ρ = 1. (ii) They also consider a weaker assumption which is equivalent to Eq. (6) up to constants (see Appendix G for a detailed comparison of the two assumptions, and the resulting upper-bounds). Using a dynamical systems perspective, Aybat et al. (2019, Corollary 3.8) prove the desired rate only for T 2 κ, whereas our proof uses the more standard Nesterov s estimate sequences and our result holds for all T. The result in Theorem 6 requires the knowledge of both µ and L and is thus not problem-adaptive. In the next section, we analyze the convergence of ASGD when it is used with offline estimates of L and µ. 4.1. Offline estimation of unknown smoothness & strong-convexity Similar to Section 3.3, for simplicity, we will assume that γk = γ = 1 L where without loss of generality, 1 L . Similarly, we use µ as the offline estimate of the strongconvexity, and assume that µ = νµµ. We will only consider the case where we underestimate µ, and hence νµ 1. This is the typical case in practice for example, while 4Though we typically expect ρ < κ, in the worst-case stochastic setting, ρ can be as large as κ and we can not obtain an accelerated rate (with a dependence on κ) in this case (Liu and Belkin, 2018). Mini-batching (see Appendix B.2) is one way to decrease the effective ρ. For a large enough batch-size ρ < κ ensuring a κ dependence in the stochastic setting. optimizing regularized convex loss functions in supervised learning (see Section 5 for empirical results), µ is set to the regularization strength, and thus underestimates the true strong-convexity parameter. The following theorem (proved in Appendix E.4.1) analyzes the effect of misspecifying L, µ on the ASGD convergence. Theorem 7. Under the same assumptions as Theorem 6 and (iv) ν = νLνµ ρκ, ASGD (Eqs. (4) and (5)) with w1 = y1, γk = 1 ρ L = νL ρL, αk = β T k/T , µ = νµµ µ, rk = q β T k/2T = q ν ρκ β T k/2T and bk = (1 rk 1) rk 1 α rk+r2 k 1 α converges as, T +1 2c3 exp min{ν, 1}T κρ α ln(T/β) + 2c3(ln(T/β))2 ρ + G2 min{k0 T , 1} max{νL νµ , ν2 L}, where k := E[f(wk) f ], c3 = exp 1 ρκ 2β ln(T/β) [x]+ = max{x, 0}, k0 := T [ln(νL)]+ ln(T/β) and G = maxj [k0] f(yj) . The above theorem implies an min{ν,1} κρ + h σ2+G2[ln(νL)]+ T i max{ νL convergence to the minimizer. Observe that (i) when the problem-dependent parameters are known (νµ = νL = 1), we recover the rate of Theorem 6, (ii) if νµ = 1, and we misestimate L, similar to SGD (Theorem 4), ASGD converges to the minimizer at an O(1/T) rate, even in the deterministic setting (when σ = 0), (iii) if νL = 1, underestimating µ matches the rate in Theorem 6 upto (potentially large) constants, resulting in linear convergence when σ = 0, and (iv) compared to Theorem 6, the decrease in the bias term is slowed down by an O exp( p min{νLνµ, 1}) factor, whereas the decrease in the variance is slowed by an O max{ νL νµ , ν2 L} factor. In the next section, we design an SLS variant that ensures convergence to the minimizer while empirically controlling the misestimation for both SGD and ASGD. 5. Experiments For comparing different step-size choices, we consider two common supervised learning losses squared loss for regression tasks and logistic loss for classification5. With 5The code to reproduce our experiments is available here: https://github.com/R3za/expsls Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD 0 50000 100000 150000 200000 250000 iteration Gradient Norm (log) 0 10000 20000 30000 40000 50000 60000 iteration 0 10000 20000 30000 40000 50000 60000 70000 80000 K-EXP SLS-EXP K-CNST M-ASG ACC-K-EXP ACC-SLS-EXP ACC-K-CNST KR-20 (a) Squared loss 0 50000 100000 150000 200000 250000 iteration Gradient Norm (log) 0 10000 20000 30000 40000 50000 60000 iteration 0 10000 20000 30000 40000 50000 60000 70000 80000 K-EXP SLS-EXP K-CNST M-ASG ACC-K-EXP ACC-SLS-EXP ACC-K-CNST KR-20 (b) Logistic loss Figure 1. Comparison for (a) squared loss and (b) logistic loss. Observe that exponentially decreasing step-sizes (i) result in more stable performance compared to using a constant step-size (for both SGD and ASGD) and (ii) consistently outperform the noise-adaptive methods in KR-20 and M-ASG, and (iii) methods using the SLS in Eq. (7) match the performance of those with known smoothness. a linear model and an ℓ2 regularization equal to λ 2 w 2, both objectives are strongly-convex. We use three standard datasets from LIBSVM (Chang and Lin, 2011) mushrooms, ijcnn and rcv1, and use λ = 0.01. For each experiment, we consider 5 independent runs and plot the average result and standard deviation. We use the (full) gradient norm as the performance measure and plot it against the number of gradient evaluations. For each dataset, we fix T = 10n, use a batch-size of 1 and compare the performance of the following optimization strategies: (i) the noise-adaptive constant and then decay step-size scheme in Khaled and Richt arik (2020, Theorem 3) (denoted as KR-20 in the plots). Specifically, for b = max{ 2L2 µ , 2ρL}, we use a constant step-size equal to 1/b when T < b/µ or k < T/2 . Otherwise we set the stepsize at iteration k to be 2 µ((2b/µ)+k T/2 ), (ii) constant step-size SGD with γk = 1 L and αk = 1 for all k (denoted as K-CNST in the plots) (iii) SGD with an exponentially decreasing step-size with knowledge of smoothness (Li et al., 2020) i.e. γk = 1 L and αk = αk for α = β T 1/T (denoted as K-EXP), (iv) Accelerated SGD (ASGD) with a constant step-size (αk = 1 for all k) (Vaswani et al., 2019a; Cohen et al., 2018) (denoted as ACC-K-CNST), (v) ASGD with exponentially decreasing step-sizes, (Section 4) denoted as ACC-K-EXP and (vi) Multistage ASGD in Aybat et al. (2019) (denoted as M-ASG) with parameters as in Corollary 3.8. Specifically we ensure that T > 2 κ, set T1 = T/C, Tk > 2k[ κ log(2(p+2))], α1 = 1/L and αk = 1/(22k L) where p and C are hyper-parameters. None of the above strategies are problem-adaptive, and all of them require the knowledge of the smoothness constant L. Additionally, KR-20 and the ASGD variants also require knowledge of ρ, the parameter of the growth condition in Eq. (6), while the ASGD variants and M-ASG require knowledge of µ. If xi is the feature vector corresponding to example i, then we obtain theoretical upperbounds on the smoothness and set L = maxi xi 2 +λ for the squared-loss and L = maxi 1 4 xi 2 + λ for the logistic loss. Similarly, we set µ = λ for both the squared and logistic loss. Note that this underestimates the true strongconvexity parameter, and is in line with Theorem 7. To set ρ, we use a grid search over {10, 100, 1000}. Similarly, to set p and C for M-ASG, we use a grid search over {1, 2, 4} and {2, 10, 100} respectively. For each method, we plot the variant that results in the smallest gradient norm. Using a stochastic line-search (SLS) to estimate L can result in convergence to the neighbourhood (Section 3.2) because of the correlations between ik and γk. To alleviate this, and still be problem-adaptive, we design a decorrelated conservative variant of SLS: at iteration k of SGD, we set γk using a stochastic line-search on the previously sampled function ik 1 (we can use a randomly sampled jk as well). This ensures that there is no correlation between ik and computing γk. The overall procedure can be Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD described as follows: starting from γk 1 (the conservative aspect), with γ0 = γmax, find the largest step-size γk that satisfies, for a random or previously sampled index jk, fjk(wk γk fjk(wk)) fjk(wk) cγk fjk(wk) 2 , (7) and update wk according to Eq. (2). The above procedure with c = 1/2 ensures that γk [min{γk 1, 1/L}, γk 1]. Since there is no correlation between γk and ik, we can treat γk as an offline estimate of the smoothness, meaning that γk = νk/L for some νk > 0. Moreover, since we are using a conservative line-search, γk [min{γk 1, 1/L}, γk 1], meaning that νk νk 1 ν1. Hence the maximum misspecification in the smoothness is given by ν1 > 1, which is governed by line-search in the first iteration. Given this, we can use a similar analysis as Theorem 4, upper-bounding νk by ν1 for each k and obtaining the corresponding result in terms of ν1. We use this variant of SLS with exponentially decreasing step-sizes for both SGD and ASGD, and denote the resulting variants as SLS-EXP and ACC-SLS-EXP respectively. We emphasize that this strategy is both noiseadaptive and problem-adaptive. From Fig. 1, we observe that exponentially decreasing stepsizes (i) result in more stable performance compared to the constant step-size variants (for both SGD and ASGD) and (ii) consistently outperform the noise-adaptive methods, KR-20 and M-ASG. We also observe that (iii) methods (SLS-EXP and ACC-SLS-EXP) using the SLS condition in Eq. (7) consistently match the performance of those with known smoothness (K-EXP and ACC-K-EXP). 6. Conclusion We used exponentially decreasing step-sizes to make SGD noise-adaptive, and considered two strategies for problemadaptivity. Using upper and lower-bounds, we quantified the price of problem-adaptivity estimating the smoothness in an online fashion results in convergence to a neighbourhood of the solution, while an offline estimation results in a slower convergence to the minimizer. We then developed an accelerated variant of SGD (ASGD) and proved that it achieves the near-optimal convergence rate. We analyzed the effect of misspecifying the strong-convexity and smoothness parameters for ASGD. Finally, we empirically demonstrated the effectiveness of (A)SGD with exponential step-sizes coupled with a novel variant of SLS. 7. Acknowledgements We would like to thank Si Yi Meng, Yifan Sun and Frederik Kunstner for helpful feedback on the paper. Ben- jamin Dubois-Taine would like to acknowledge support from the European Research Council (grant SEQUOIA 724063) and funding by the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute). Arjevani, Y., Shamir, O., and Srebro, N. (2020). A tight convergence analysis for stochastic gradient descent with delayed updates. In Algorithmic Learning Theory, pages 111 132. PMLR. Armijo, L. (1966). Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of mathematics. Aybat, N. S., Fallah, A., Gurbuzbalaban, M., and Ozdaglar, A. (2019). A universally optimal multistage accelerated stochastic gradient method. Advances in neural information processing systems, 32:8525 8536. Bengio, Y. (2015). Rmsprop and equilibrated adaptive learning rates for nonconvex optimization. corr abs/1502.04390. Berrada, L., Zisserman, A., and Kumar, M. P. (2020). Training neural networks for and by interpolation. In International Conference on Machine Learning, pages 799 809. PMLR. Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311. Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):1 27. Software available at http://www.csie.ntu. edu.tw/ cjlin/libsvm. Cohen, M., Diakonikolas, J., and Orecchia, L. (2018). On acceleration with noise-corrupted gradients. In International Conference on Machine Learning, pages 1019 1028. PMLR. Duchi, J. C., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121 2159. Ghadimi, S. and Lan, G. (2012). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469 1492. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Ghadimi, S. and Lan, G. (2013). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061 2089. Gower, R., Sebbouh, O., and Loizou, N. (2021). Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics, pages 1315 1323. PMLR. Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., and Richt arik, P. (2019). SGD: General analysis and improved rates. In ICML. Hinder, O., Sidford, A., and Sohoni, N. (2020). Nearoptimal methods for minimizing star-convex functions and beyond. In Conference on Learning Theory, pages 1894 1938. PMLR. Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., and Sidford, A. (2018). Accelerating stochastic gradient descent for least squares regression. In Conference On Learning Theory, pages 545 604. PMLR. Karimi, H., Nutini, J., and Schmidt, M. (2016). Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795 811. Springer. Khaled, A. and Richt arik, P. (2020). Better theory for sgd in the nonconvex world. ar Xiv preprint ar Xiv:2002.03329. Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR. Kleinberg, B., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, pages 2698 2707. PMLR. Kulunchakov, A. and Mairal, J. (2019). Estimate sequences for variance-reduced stochastic composite optimization. In International Conference on Machine Learning, pages 3541 3550. PMLR. Levy, K. Y., Yurtsever, A., and Cevher, V. (2018). Online adaptive methods, universality and acceleration. In Advances in Neural Information Processing Systems, Neur IPS. Li, X. and Orabona, F. (2019). On the convergence of stochastic gradient descent with adaptive stepsizes. In AISTATS. Li, X., Zhuang, Z., and Orabona, F. (2020). A second look at exponential and cosine step sizes: Simplicity, convergence, and performance. ar Xiv preprint ar Xiv:2002.05273. Liu, C. and Belkin, M. (2018). Accelerating sgd with momentum for over-parameterized learning. ar Xiv preprint ar Xiv:1810.13395. Lohr, S. L. (2019). Sampling: Design and Analysis: Design and Analysis. Chapman and Hall/CRC. Loizou, N., Vaswani, S., Laradji, I. H., and Lacoste-Julien, S. (2021). Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. In International Conference on Artificial Intelligence and Statistics, pages 1306 1314. PMLR. Lucas, J., Bae, J., Zhang, M. R., Fort, S., Zemel, R., and Grosse, R. (2021). Analyzing monotonic linear interpolation in neural network loss landscapes. ar Xiv preprint ar Xiv:2104.11044. Mishkin, A. (2020). Interpolation, growth conditions, and stochastic gradient descent. Ph D thesis, University of British Columbia. Moulines, E. and Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24:451 459. Nemirovski, A. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. Wiley Interscience. Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course. Springer Science & Business Media. Nguyen, P. H., Nguyen, L. M., and van Dijk, M. (2018). Tight dimension independent lower bound on the expected convergence rate for diminishing step sizes in sgd. ar Xiv preprint ar Xiv:1810.04723. Nocedal, J. and Wright, S. (2006). Numerical optimization. Springer Science & Business Media. Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407. Schmidt, M. and Roux, N. L. (2013). Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv preprint ar Xiv:1308.6370. Stich, S. U. (2019). Unified optimal analysis of the (stochastic) gradient method. ar Xiv preprint ar Xiv:1907.04232. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Vaswani, S., Bach, F., and Schmidt, M. (2019a). Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1195 1204. PMLR. Vaswani, S., Laradji, I. H., Kunstner, F., Meng, S. Y., Schmidt, M., and Lacoste-Julien, S. (2020). Adaptive gradient methods converge faster with overparameterization (and you can do a line-search). Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., and Lacoste-Julien, S. (2019b). Painless stochastic gradient: Interpolation, line-search, and convergence rates. Advances in neural information processing systems, 32:3732 3745. Zhang, L. and Zhou, Z.-H. (2019). Stochastic approximation of smooth and strongly convex functions: Beyond the o(1/t) convergence rate. In Conference on Learning Theory, pages 3160 3179. PMLR. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Organization of the Appendix A Definitions B Additional theoretical results C Upper-bound Proofs for Section 3 D Lower-bound proofs for Section 3 E Proofs for Section 4 F Helper Lemmas G Comparison to Aybat et al. (2019) A. Definitions Our main assumptions are that each individual function fi is differentiable, has a finite minimum f i , and is Li-smooth, meaning that for all v and w, fi(v) fi(w) + fi(w), v w + Li 2 v w 2 , (Individual Smoothness) which also implies that f is L-smooth, where L is the maximum smoothness constant of the individual functions. A consequence of smoothness is the following bound on the norm of the stochastic gradients, fi(w) 2 2L(fi(w) f i ). We also assume that each fi is convex, meaning that for all v and w, fi(v) fi(w) fi(w), w v , (Convexity) Depending on the setting, we will also assume that f is µ strongly-convex, meaning that for all v and w, f(v) f(w) + f(w), v w + µ 2 v w 2 , (Strong Convexity) B. Additional theoretical results In this section, we relax the strong-convexity assumption to handle broader function classes in Appendix B.1 and prove results that help provide an explicit dependence on the mini-batch size (Appendix B.2) and in Appendix B.3 show that polynomially decreasing step-sizes cannot obtain the desired noise-adaptive rate. B.1. Relaxing the assumptions In this section, we extend our theoretical results to a richer class of functions - strongly quasar-convex functions (Hinder et al., 2020) in Appendix B.1.1, and (non-strongly) convex functions in Appendix B.1.2. B.1.1. EXTENSION TO STRONGLY STAR-CONVEX FUNCTIONS We consider the class of smooth, non-convex, but strongly star-convex functions (Hinder et al., 2020; Gower et al., 2021), a subset of strongly quasar-convex functions. Quasar-convex functions are unimodal along lines that pass through a global minimizer i.e. the function monotonically decreases along the line to the minimizer, and monotonically increases thereafter. In addition to this, strongly quasar-convex functions also have curvature near the global minimizer. Importantly, this property is satisfied for neural networks for common architectures and learning problems (Lucas et al., 2021; Kleinberg et al., 2018). Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Formally, a function is (ζ, µ) strongly quasar-convex if it satisfies the following for all w and minimizers w , f(w ) f(w) + 1 ζ f(w), w w + µ 2 w w 2 . (8) Strongly star-convex functions are a subset of this class of functions with ζ = 1. If L is known, it is straightforward to show that the results of Theorem 1 carry over to the strongly star-convex functions and we obtain the similar O exp( T/κ) + σ2 T rate. In the case when L is not known, it was recently shown that SGD with a stochastic Polyak step-size (Gower et al., 2021) results in linear convergence to the minimizer on strongly star-convex functions under interpolation and achieves an O exp( T) + γmaxσ2 convergence rate in general. The proposed stochastic Polyak step-size (SPS) does not require knowledge of L, and matches the rate achieved for strongly-convex functions (Loizou et al., 2021). However, SPS requires knowledge of f i , which is usually zero for machine learning models under interpolation but difficult to get a handle on in the general case. Consequently, we continue to use SLS to estimate the smoothness constant. Our proofs only use strong-convexity between w and a minimizer w , and hence we can extend all our results from strongly-convex functions, to structured non-convex functions satisfying the strongly star-convexity property, matching the rates in Theorem 2 and Theorem 4. Finally, we note that given knowledge of ζ, there is no fundamental limitation in extending all our results to strongly quasar-convex functions. In the next section, we relax the strong-convexity assumption in a different way - by considering convex functions without curvature. B.1.2. HANDLING (NON-STRONGLY)-CONVEX FUNCTIONS In this section, we analyze the behaviour of exponentially decreasing step-sizes on convex functions (without strongconvexity). As a starting point, we assume that L is known, and the algorithm is only required to adapt to the noise σ2. In the following theorem (proved in Appendix C.4), we show that SGD with an exponentially decreasing step-size is not guaranteed to converge to the minimizer, but to a neighbourhood of the solution. Theorem 8. Assuming (i) convexity and (ii) Li-smoothness of each fi, SGD with step-size ηk = 1 2L αk has the following convergence rate, E[f( w T +1) f(w )] 2L w1 w 2 PT k=1 αk + σ2 PT k=1 α2 k PT k=1 αk (9) where w T +1 = PT k=1 αkwk PT k=1 αk . For αk = h β T ik/T , the convergence rate is given by, E[f( w T +1) f(w )] 2L ln(T/β) w1 w 2 αT 2β + σ2 T T β We thus see that even with the knowledge of L, SGD converges to a neighbourhood of the solution at an O(1/T) rate. We contrast our result to Ada Grad (Duchi et al., 2011; Levy et al., 2018) that adapts the step-sizes as the algorithm progresses (as opposed to using a predetermined sequence of step-sizes like in our case), is able to adapt to the noise, and achieves an O 1 T + σ2 In order to be noise-adaptive and match the Ada Grad rate, we can use Eq. (9) to infer that a sufficient condition is for the αk-sequence to satisfy the following inequalities, (i) αk C1 T and (ii) α2 k C2 T where C1, C2 are constants. Unfortunately, in Lemmas 9 and 10, we prove that it is not possible for any polynomially or exponentially-decreasing sequence to satisfy these sufficient conditions. While we do not have a formal lower-bound in the convex case, it seems unlikely that these αk-sequences can result in the desired rate, and we conjecture a possible lower-bound. Finally, we note that to the best of our knowledge, the only predetermined (non-adaptive) step-size that achieves the Ada Grad rate is min n 1 2L, 1 σ o (Ghadimi and Lan, 2012). We also conjecture a lower-bound that shows that there is no predetermined Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD sequence of step-sizes (that does not use knowledge of σ2) that is noise-adaptive and can achieve the O 1 T + σ2 B.2. Dependence on the mini-batch size In this section, we prove two results in order to explicitly model the dependence on the mini-batch size. We denote a mini-batch as B, its size as B [1, n] and the corresponding mini-batch gradient as f B(w) = 1 B P fi B fi(w). The mini-batch gradient is also unbiased i.e. EB[ f B(w)] = f(w), implying that all the proofs remain unchanged, but we need to use a different growth condition for the ASGD proofs in Section 4 and a different definition of σ for the SGD proofs in Section 3. We refine these quantities here, and show the explicit dependence on the mini-batch size. Lemma 1. If Ei fi(w) 2 ρ f(w) 2 + σ2, EB f B(w)) 2 (ρ 1)n B n B + 1 f(w) 2 + n B EB f B(w)) 2 = EB f B(w) f(w) + f(w) 2 = EB f B(w) f(w) 2 + f(w) 2 (Since EB[ f B(w)] = f(w)) Since we are sampling the batch with replacement, using (Lohr, 2019), Ei fi(w) 2 f(w) 2 + f(w) 2 (ρ 1) f(w) 2 + σ2 + f(w) 2 (Using the growth condition) = EB f B(w)) 2 (ρ 1)n B n B + 1 f(w) 2 + n B Lemma 2. If σ2 := E[fi(w ) f i ], and each function fi is µ strongly-convex and L-smooth, then σ2 B := EB[f B(w ) f B] L By strong-convexity of fi, EB[f B(w ) f B] 1 2µEB f B(w )) 2 Since we are sampling the batch with replacement, using (Lohr, 2019), n B Ei fi(w ) 2 n B E[fi(w ) f i ] (By smoothness of fi) Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD B.3. Polynomially decaying stepsizes In this section, we analyze polynomially decreasing step-sizes, namely when ηk = η (k+1)δ for some constants η > 0 and 0 δ 1. We argue that even with knowledge of the smoothness constant, these step-sizes fail to converge at the desired noise-adaptive rate even on simple quadratics. In particular, the next lemma shows that gradient descent (GD) applied to a strongly-convex quadratic with a polynomially decreasing step-size fails to obtain the usual linear rate of the form O(ρ T ) for some ρ < 1. Lemma 3. When using T iterations of GD to minimize a one-dimensional quadratic f(w) = 1 2(xw y)2, setting ηk = 1 L 1 (k+1)δ for some 0 < δ 1 results in the following lower bounds. If δ = 1, w T +1 w = (w1 w ) 1 T + 1 If 0 < δ < 1, w1 w > 0 and T is large enough, w T +1 w (w1 w ) 1 1 21/δ 1 4 2δ 1 1 δ 4 (T +1)1 δ Proof. Observe that w = y/x and L = x2. The GD iteration with ηk = 1 L 1 (k+1)δ reads wk+1 = wk 1 L 1 (k + 1)δ x2wk xy = wk 1 1 (k + 1)δ x 1 (k + 1)δ = wk 1 1 (k + 1)δ + w 1 (k + 1)δ wk+1 w = (wk w ) 1 1 (k + 1)δ w T +1 w = (w1 w ) 1 1 (k + 1)δ w T +1 w = (w1 w ) k k + 1 = (w1 w ) 1 T + 1 If 0 < δ < 1 and w1 w > 0, w T +1 w = (w1 w ) 1 1 (k + 1)δ 1 2 2(k + 1)δ We wish to use the inequality 1 2x 2 2 2x which is true for all x [0, 1/2]. In our case it holds for 1 (k + 1)δ 1 Let k0 = 21/δ . Then for T k0, w T +1 w = (w1 w ) 1 1 (k + 1)δ 1 2 2(k + 1)δ Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Now, for k k0 1, we have that 1 (k+1)δ 1 2δ and thus 1 1 (k + 1)δ For k k0, we have 1 2 2(k+1)δ 2 2 1 (k+1)δ and thus 1 2 2(k + 1)δ 2 2 PT k=k0 1 (k+1)δ = 2 2 PT +1 k=1 1 kδ Pk0 k=1 1 kδ 2 2 PT +1 k=1 1 kδ Using the bound in the proof of Lemma 9, we have 1 kδ 1 + 1 1 δ (T + 1)1 δ 1 Putting this together we have that 2 2 PT +1 k=1 1 kδ 2 2(1+ 1 1 δ((T +1)1 δ 1)) = 41/(1 δ) 4 4 (T +1)1 δ 1 δ = 4 2δ 1 1 δ 4 (T +1)1 δ Putting everything together we get that w T +1 w (w1 w ) 1 1 21/δ 1 4 2δ 1 1 δ 4 (T +1)1 δ The next lemma shows that when δ = 0, namely when the step-size is constant, SGD applied to the sum of two quadratics fails to converge to the minimizer. Lemma 4. When using SGD to minimize the sum f(w) = f1(w)+f2(w) 2 of two one-dimensional quadratics: f1(w) = 1 2(w 1)2 and f2(w) = 1 2(2w + 1/2)2 with a constant step-size η = 1 L, the following holds: whenever |wk w | < 1/8, the next iterate satisfies |wk+1 w | > 1/8. Proof. First observe that w = 0 and that L = 4. The updates then read If ik = 1 : wk+1 = wk η(wk 1) = wk(1 1 If ik = 2 : wk+1 = wk η2(2wk + 1 2) = wk(1 4 Suppose that |wk w | =|wk| < 1/8. We want to show that |wk+1| > 1/8. We can separate the analyses in three cases. If wk ( 1/8, 0) and ik = 1 then If wk (0, 1/8) and ik = 1 then If ik = 2 then Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD implying that in each case, |wk+1| > 1/8. C. Upper-bound Proofs for Section 3 C.1. Proof of Theorem 1 Theorem 1. Assuming (i) convexity and Li-smoothness of each fi, (ii) µ strong-convexity of f, SGD (Eq. (2)) with γk = 1 L, αk = β T k/T converges as, E w T +1 w 2 w1 w 2 c2 exp T κ α ln(T/β) µe2 (ln(T/β))2 where c2 = exp 1 κ 2β ln(T/β) . wk+1 w 2 = wk ηk fik(wk) w 2 = wk w 2 2ηk fik(wk), wk w + η2 k fik(wk) 2 = wk w 2 2γkαk fik(wk), wk w + γ2 kα2 k fik(wk) 2 wk+1 w 2 wk w 2 2γkαk fik(wk), wk w + γ2 kα2 k 2L[fik(wk) f ik] (Smoothness) Lαk fik(wk), wk w + 2 Lα2 k [fik(wk) fik(w )] + 2 Lα2 k [fik(w ) f ik] (Since γk = 1/L.) Taking expectation w.r.t ik, E wk+1 w 2 E wk w 2 2 Lαk f(wk), wk w + 2 Lα2 k [f(wk) f(w )] + 2 Lαk f(wk), wk w + 2 Lαk [f(wk) f(w )] + 2 Lα2 kσ2 (Since αk 1) E wk+1 w 2 1 µαk E wk w 2 + 2 Lα2 kσ2 (By µ-strong convexity of f) Unrolling the recursion starting from w1 and using the exponential step-sizes, E w T +1 w 2 w1 w 2 T Y i=k+1 α2k 1 µαi Writing k = E wk w 2 T +1 1 exp µ k=1 α2k exp µ | {z } :=Bt Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Using Lemma 5 to lower-bound A, we obtain A αT ln(T/β) 2β ln(T/β). The first term in the above expression can then be bounded as, LA = 1 c2 exp T κ α ln(T/β) where κ = L µ and c2 = exp 1 κ 2β ln(T/β) . Using Lemma 6 to upper-bound Bt, we obtain Bt 4κ2c2(ln(T/β))2 e2α2T , thus bounding the second term. Putting everything together, T +1 1 c2 exp T κ α ln(T/β) Le2 (ln(T/β))2 C.2. Proof of Theorem 2 Theorem 2. Under the same assumptions as Theorem 1, SGD (Eq. (2)) with αk = β T k/T , γk as the largest step-size that satisfies γk γmax and Eq. (3) with c = 1/2 converges as, E w T +1 w 2 w1 w 2 c1 exp T κ α ln(T/β) + 8σ2c1(κ )2γmax e2 (ln(T/β))2 + 2σ2c1κ ln(T/β) γerr where γerr := γmax min γmax, 1 κ := max n L µ , 1 µγmax o , c1 = exp 1 κ 2β ln(T/β) . wk+1 w 2 wk w 2 2γkαk fik(wk), wk w + γkα2 k fik(wk) f ik c (By Lemma 8) Setting c = 1/2, = wk w 2 2γkαk fik(wk), wk w + 2γkα2 k [fik(wk) f ik] = wk w 2 2γkαk fik(wk), wk w + 2γkα2 k [fik(wk) fik(w )] + 2γkα2 k [fik(w ) f ik] Adding, subtracting 2γkαk[fik(wk) fik(w )], = wk w 2 + 2γkαk [ fik(wk), wk w + [fik(wk) fik(w )]] 2γkαk[fik(wk) fik(w )] + 2γkα2 k [fik(wk) fik(w )] + 2γkα2 k [fik(w ) f ik] wk w 2 + 2γminαk [ fik(wk), wk w + [fik(wk) fik(w )]] 2γk(αk α2 k)[fik(wk) fik(w )] + 2γmaxα2 k [fik(w ) f ik] where we used convexity of fik to ensure that fik(wk), wk w + [fik(wk) fik(w )] 0. Taking expectation, E wk+1 w 2 wk w 2 + 2γminαk [ f(wk), wk w + [f(wk) f(w )]] Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD (αk α2 k)E [2γk[fik(wk) fik(w )]] + 2γmaxα2 kσ2 E wk w 2 (1 αkγminµ) wk w 2 (αk α2 k)E [2γk[fik(wk) fik(w )]] + 2γmaxα2 kσ2 Since αk 1, and αk α2 k 0, let us analyze E[[γk[fik(wk) fik(w )]]. E[[γk[fik(wk) fik(w )]] = E[[γk[fik(wk) f ik]] E[[γk[f ik fik(w )]] E[[γmin[fik(wk) f ik]] E[[γmax[f ik fik(w )]] (γk γmax) = E[[γmin[fik(wk) f ik]] + γmaxσ2 = E[[γmin[fik(wk) fik(w )]] E[[γmin[fik(w ) f ik]] + γmaxσ2 = γmin[f(wk) f(w )] γminσ2 + γmaxσ2 (γmax γmin)σ2 Putting this relation back, E wk w 2 (1 αkγminµ) wk w 2 + 2(αk α2 k) (γmax γmin)σ2 + 2γmaxα2 kσ2 (1 αkγminµ) wk w 2 + 2αk (γmax γmin)σ2 + 2γmaxα2 kσ2. Setting κ = max{ L µ , 1 µγmax } we get that 1 αkγminµ 1 1 κ . Writing k = E wk w 2 and unrolling the recursion we get 1 + 2γmaxσ2 T X k=1 α2k T Y κ αi + 2 σ2 T X k=1 αk(γmax γmin) + 2γmaxσ2 T X k=1 α2k exp 1 | {z } :=Bt + 2 σ2 (γmax γmin) k=1 αk exp 1 | {z } :=Ct Using Lemma 5 to lower-bound A, we obtain A αT ln(T/β) 2β ln(T/β). The first term in the above expression can then be bounded as, κ A 1 c1 exp T κ α ln(T/β) where c1 = exp 1 κ 2β ln(T/β) . Using Lemma 6 to upper-bound Bt, we obtain Bt 4(κ )2c1(ln(T/β))2 e2α2T , thus bounding the second term. Using Lemma 7 to upper-bound Ct, we obtain Ct c1 κ ln(T/β) eα , thus bounding the third term. Finally, by Lemma 8 we have that γmin min γmax, 1 Putting everything together, T +1 1 c1 exp T κ α ln(T/β) + 8σ2c1(κ )2γmax e2 (ln(T/β))2 α2T + 2c1σ2κ ln(T/β) γmax min γmax, 1 Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD C.3. Proof of Theorem 4 Theorem 4. Under the same assumptions as Theorem 1, SGD (Eq. (2)) with αk = β T k/T , γk = ν L converges as, T +1 1 c2 exp min{ν, 1} T κ α ln(T/β) + max{ν2, 1}8c2κ ln(T/β) µ e2 α2 T 2σ2 ln(T/β) + G [ln(ν)]+ where c2 = exp 1 κ 2β ln(T/β) , [x]+ = max{x, 0}, k0 = T [ln(ν)]+ ln(T/β) , G = maxj [k0]{f(wj) f } and k := wk w 2. Proof. Following the steps from the proof of Theorem 1, wk+1 w 2 wk w 2 2γkαk fik(wk), wk w + 2Lγ2 kα2 k [fik(wk) fik(w )] + 2Lγ2 kα2 k [fik(w ) f ik] Taking expectation wrt ik, and since both γk and αk are independent of ik, E wk+1 w 2 wk w 2 2γkαk f(wk), wk w + 2Lγ2 kα2 k [f(wk) f ] + 2Lγ2 kα2 k σ2 E wk+1 w 2 (1 µγkαk) wk w 2 + 2Lγ2 kα2 k σ2 + [f(wk) f ] (2Lγ2 kα2 k 2γkαk) (By strong convexity) Let us separately consider the ν 1 and ν > 1 case. For the ν 1 case, (2Lγ2 kα2 k 2γkαk) = 2ν2α2 k L 2ναk L = 0. Hence, the above equation can be simplified as: E wk+1 w 2 1 µναk wk w 2 + 2ν2α2 k L σ2 Proceeding in the same way as the proof of Theorem 1, define k = E wk w 2 and unroll the recursion, L αk) + 2ν2σ2 i=k+1 (1 µν Bounding the first term similar to Lemma 5, L αk) exp µν κ α ln (T/β) κ 2β ln (T/β) Bounding the second term similar to Lemma 6, i=k+1 (1 µν k=1 α2 k exp k=1 α2 k exp ν κ αk+1 αT +1 Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD = exp ναT +1 k=1 α2 k exp ναk+1 = exp ναT +1 ν2e2α2 ln(T/β)2 κ 2β ln(T/β) ν2e2α2 ln(T/β)2 Putting everything together, we obtain that, T +1 1 exp νT κ α ln (T/β) κ 2β ln (T/β) κ 2β ln(T/β) e2α2 ln(T/β)2 = T +1 1c2 exp νT κ α ln (T/β) e2α2 ln(T/β)2 T (Since ν 1) For the ν > 1 case, E wk+1 w 2 1 µναk wk w 2 + 2ν2α2 k σ2 L + [f(wk) f ] (2Lγ2 kα2 k 2γkαk) wk w 2 + 2ν2α2 k σ2 L + [f(wk) f ] 2ν2α2 k L 2ναk (Since ν > 1) For the last term to be negative, we require αk 1 ν . By definition of αk, this will happen after k k0 := T ln(ν) ln(T/β) iterations. However, until k0 iterations, we observe that (2Lγ2 kα2 k 2γkαk) 2ν (ν 1) For the k < k0 regime, E wk+1 w 2 1 µαk wk w 2 + 2Lγ2 kα2 k σ2 + max j [k0]{f(wj) f } 2ν (ν 1) Writing k = E wk w 2, and unrolling the recursion for the first k0 iterations we get L σ2 + max j [k0]{f(wj) f }2ν(ν 1) L | {z } :=c5 Bounding the first term similar to Lemma 5, Bounding the second term similar to Lemma 6, k=1 α2 k exp Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD k=1 α2 k exp 1 k=1 α2 k exp αk+1 e2α2 k0 ln(T/β)2 Putting everything together, we obtain, + c5 exp αk0 e2α2 k0 ln(T/β)2 Now let us consider the regime k k0 where αk 1 ν , so that we have E wk+1 w 2 1 µαk wk w 2 + 2ν2σ2 Writing k = E wk w 2, and unrolling the recursion from k = k0 to T, Lαk) + 2ν2σ2 Bounding the first term similar to Lemma 5, L αk0 αT +1 Bounding the second term similar to Lemma 6, k=k0 α2 k exp k=k0 α2 k exp 1 κ αk+1 αT +1 = exp αT +1 k=k0 α2 k exp αk+1 = exp αT +1 e2α2 (T k0 + 1) e2α2 (T k0 + 1) ln(T/β)2 Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Putting everything together, T +1 k0 exp µ L αk0 αT +1 L exp αT +1 e2α2 (T k0 + 1) ln(T/β)2 Combining the above bounds for the two regimes, we get, L αk0 αT +1 + c5 exp µαk0 e2α2 k0 ln(T/β)2 L exp αT +1 e2α2 (T k0 + 1) ln(T/β)2 2e2α2 k0 ln(T/β)2 L exp αT +1 e2α2 (T k0 + 1) ln(T/β)2 Using Lemma 5 to bound the first term, and noting that αT +1 1 α 2β ln(T/β) T +1 1 c2 exp T κ α ln(T/β) e2α2 k0 ln(T/β)2 T 2 + 2ν2σ2 e2α2 (T k0 + 1) ln(T/β)2 where κ = L µ and c2 = exp 1 κ 2β ln(T/β) . Putting in the value of c5 and k0, and rearranging, we get T +1 1 c2 exp T κ α ln(T/β) L T 4c2κ2 ln(T/β)2 + max j [k0]{f(wj) f }2ν(ν 1) ν2e2α2 [ln(ν)]+ ln(T/β) Combining the statements from ν 1 and ν > 1 gives us the theorem statement. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD C.4. Proof of Theorem 8 Theorem 8. Assuming (i) convexity and (ii) Li-smoothness of each fi, SGD with step-size ηk = 1 2L αk has the following convergence rate, E[f( w T +1) f(w )] 2L w1 w 2 PT k=1 αk + σ2 PT k=1 α2 k PT k=1 αk (9) where w T +1 = PT k=1 αkwk PT k=1 αk . For αk = h β T ik/T , the convergence rate is given by, E[f( w T +1) f(w )] 2L ln(T/β) w1 w 2 αT 2β + σ2 T T β Proof. Following the proof of Theorem 1, wk+1 w 2 wk w 2 2γkαk fik(wk), wk w + 2Lγ2 kα2 k [fik(wk) fik(w )] Lγ2 kα2 k [fik(wk) f ik] wk+1 w 2 wk w 2 αk L fik(wk), wk w + α2 k 2L [fik(wk) fik(w )] + α2 k 2L [fik(wk) f ik] (γk = 1 2L for all k.) L [fik(wk) fik(w )] + α2 k 2L [fik(wk) fik(w )] + α2 k 2L [fik(wk) f ik] (By convexity) Taking expectation, E wk+1 w 2 wk w 2 αk L [f(wk) f(w )] + α2 k 2L [f(wk) f(w )] + α2 k 2Lσ2 2L[f(wk) f(w )] + α2 k 2Lσ2 (Since f(wk) f(w ) 0 and αk 1) Rearranging and summing from k = 1 to T, k=1 αk[f(wk) f(w )] 2L w1 w 2 + σ2 T X By averaging and using Jensen. Denote w T +1 = PT k=1 αkwk PT k=1 αk , E[f( w T +1) f(w )] 2L w1 w 2 PT k=1 αk + σ2 PT k=1 α2 k PT k=1 αk Next, we bound PT k=1 αk and PT k=1 α2 k for the exponentially-decreasing αk sequence, when αk = h β T ik/T . From Lemma 5, we know that, k=1 αk αT ln(T/β) 2β ln(T/β). Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Bounding the ratio PT k=1 α2 k PT k=1 αk = PT k=1 α2k PT k=1 αk where α = h β T i1/T , PT k=1 α2k PT k=1 αk α2 1 α2 1 α α αT +1 = α 1 + α 1 1 αT 1 1 αT = T T β Putting everything together, E[f( w T +1) f(w )] 2L ln(T/β) w1 w 2 αT 2β + σ2 T T β C.5. Additional lemmas for upper-bound proofs t=1 αt αT ln(T/β) 2β ln(T/β) t=1 αt = α αT +1 1 α = α 1 α αT +1 1 α = αβ T(1 α) = β T 2 ln(1/α) = β T 2 1 T ln(T/β) = 2β ln(T/β) (10) where in the inequality we used Lemma 16 and the fact that 1/α > 1. Plugging back into A we get, A α 1 α 2β ln(T/β) α ln(1/α) 2β ln(T/β) (1 x ln( 1 = αT ln(T/β) 2β ln(T/β) Lemma 6. For α = β T 1/T and any κ > 0, k=1 α2k exp 4κ2c2(ln(T/β))2 where c2 = exp 1 κ 2β ln(T/β Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Proof. First observe that, i=k+1 αi = αk+1 αT +1 1 α = αβ T(1 α) = β T 2 ln(1/α) = β T 2 1 T ln(T/β) = 2β ln(T/β) where in the inequality we used 16 and the fact that 1/α > 1. These relations imply that, i=k+1 αi αk+1 1 α 2β ln(T/β) κ 2β ln(T/β) We then have k=1 α2k exp k=1 α2k exp 1 k=1 α2k 2(1 α)κ 2 (Lemma 17) e2α2 T(1 α)2 e2α2 T(ln(1/α))2 = 4κ2c2(ln(T/β))2 Lemma 7. For α = β T 1/T and any κ > 0, c2 κ ln(T/β) for c2 = exp 1 κ 2β ln(T/β) Proof. Proceeding in the same way as Lemma 6, we obtain the following inequality, k=1 αk exp 1 Further bounding this term, k=1 αk (1 α)κ eαk+1 (Lemma 17) Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD c2 ln(1/α)κT = c2 κ ln(T/β) Lemma 8. If fi is Li-smooth, stochastic lines-searches ensures that γ fi(w) 2 1 c (fi(w) f i ), and min γmax, 2 (1 c) Moreover, if fi is a one-dimensional quadratic, γ = min γmax, 2 (1 c) Proof. Recall that if fi is Li-smooth, then for an arbitrary direction d, fi(w d) fi(w) fi(w), d + Li For the stochastic line-search, d = γ fi(w). The smoothness and the line-search condition are then Smoothness: fi(w γ fi(w)) fi(w) Li 2 γ2 γ fi(w) 2 , Line-search: fi(w γ fi(w)) fi(w) cγ fi(w) 2 . The line-search condition is looser than smoothness if Li 2 γ2 γ fi(w) 2 cγ fi(w) 2 . The inequality is satisfied for any γ [a, b], where a, b are values of γ that satisfy the equation with equality, a = 0, b = 2(1 c)/Li, and the line-search condition holds for γ 2(1 c)/Li. As the line-search selects the largest feasible step-size, γ 2(1 c)/Li. If the step-size is capped at γmax, we have η min{γmax, 2(1 c)/Li}, and the proof for the stochastic line-search is complete. From the previous discussion, observe that if γ > 2(1 c) Li , then we have 2 γ2 γ fi(w) 2 > cγ fi(w) 2 . If f is a one-dimensional quadratic, the smoothness inequality is actually an equality, and thus fi(w γ fi(w)) fi(w) = Li 2 γ2 γ fi(w) 2 So if γ > 2(1 c) fi(w γ fi(w)) fi(w) cγ fi(w) 2 and the line-search condition does not hold. This implies that for one-dimensional quadratics γ = min{γmax, 2(1 c) Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD D. Lower-bound proofs for Section 3 D.1. Proof of Theorem 3 Theorem 3. When using T iterations of SGD to minimize the sum f(w) = f1(w)+f2(w) 2 of two onedimensional quadratics, f1(w) = 1 2(w 1)2 and f2(w) = 1 2 (2w + 1/2)2, setting γk using SLS with γmax 1 and c 1/2, any convergent sequence of αk results in convergence to a neighbourhood of the solution. Specifically, if w is the minimizer of f and w1 > 0, then, E(w T w ) min w1, 3 Proof. For SLS with a general c 1/2 on quadratics, we know that γk = 2(1 c) Lik (see Lemma 8 for a formal proof). Recall that we consider two one-dimensional quadratics fi(w) = 1 2(wxi yi)2 for i {1, 2} such that x1 = 1, y1 = 1, x2 = 2, y2 = 1 2. Specifically, 2(w 1)2 L1 = 1 4(w 1)2 + 1 wk+1 = wk αk2(1 c)(wk 1) = 2(1 c)αk + (1 2(1 c)αk)wk wk+1 = wk 2(1 c)αk 2 4(2wk + 1 2) = (1 2(1 c)αk)wk 1 Ewk+1 = (1 2(1 c)αk)wk + 1 22(1 c)αk 1 82(1 c)αk = (1 2(1 c)αk)wk + 3 Ew T = E(w T w ) = (w1 w ) k=1 (1 2(1 c)αk) + 3 k=1 2(1 c)αk i=k+1 (1 2(1 c)αi) Using Lemma 18 and the fact that 2(1 c)αk 1 for all k, we have that if w1 w = w1 > 0, E(w T w ) min w1, 3 Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD D.2. Proof of Theorem 5 Theorem 5. When minimizing a one-dimensional quadratic function f(w) = 1 2(xw y)2, GD with αk = β T k/T , γk = ν L for ν > 3, satisfies wk+1 w = (w1 w ) i=1 (1 ναi). After k := T ln(T/β) ln ν 3 iterations, we have that |wk +1 w | 2k |w1 w |. Proof. One has w = y x and L = x2. Therefore wk+1 w = wk w αkηkx (xwk y) = wk w αk ν LLwk + αk ν Lxy = wk w αkνwk + αkρw = (1 ναk)(wk w ) Iterating gives the first part of the result. Now, for k k , we have 1 ναk 1 ναk 1 να T ln(T/β) (ln ν ln 3) = 1 ν β 1 ln(T/β) (ln ν ln 3) = 1 ν 3 |wk +1 w | =|w1 w | i=1 |1 ναk| |w1 w |2k D.3. Lemmas for convex setting Lemma 9. The polynomial stepsize defined as αk = (1/k)δ for some 0 δ 1 cannot satisfy PT k=1 αk C1T and PT k=1 α2 k C2 T for positive constants C1 and C2. Proof. If δ = 0, αk = 1 for all k, and then PT k=1 α2 k = T. If δ = 1, then PT k=1 αk = Θ(ln T). If 0 < δ < 1, basic calculus shows that 1 kδ 1 + Z T 1 1 δ (T + 1)1 δ 1 1 kδ 1 + 1 1 δ T 1 δ 1 which shows that PT k=1 αk = Θ(T 1 δ), and thus we cannot have PT k=1 αk C1T for all T. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Lemma 10. The exponential stepsize defined as αk = αk for some α < 1 cannot satisfy PT k=1 αk C1T and PT k=1 α2 k C2 T for positive constants C1 and C2. Proof. Suppose by contradiction that the exponential stepsize satisfies the two conditions. Then k=1 α2k 1 = By assumption, P2T k=1 αk C12T and PT k=1 α2k C2 T. Therefore k=1 α2k 2C1T 1 But then we obtain which is a contradiction by taking T to infinity. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD E. Proofs for Section 4 E.1. Reformulation Let us consider a general ASGD update whose parameters satisfy the following conditions. r2 k = (1 rk)r2 k 1 ηk ηk 1 + rkµηk. (11) bk = (1 rk 1)rk 1 ηk ηk 1 rk + r2 k 1 ηk ηk 1 , (12) It can be verified that setting ηk = γkαk = 1 ρL β T k/T , rk = q µ ρL β T k/2T satisfies Eq. (11). We first show that the update in Eq. (4)-Eq. (5) satisfying the conditions in Eq. (12) and Eq. (11) can be written in an equivalent form more amenable to the analysis. Lemma 11. The following update: yk = wk rkqk qk + rkµ(wk zk) (13) wk+1 = yk ηk fik(yk) (14) zk+1 = wk + 1 rk [wk+1 wk] (15) qk+1 = (1 rk)qk + rkµ (16) r2 k = qk+1ηk (17) zk+1 = 1 qk+1 [(1 rk)qkzk + rkµyk rk fik(yk)] (18) is equivalent to the update in Eq. (4)-Eq. (5). First we check the consistency of the update (Eq. (15)) and definition (Eq. (18)) of zk. Using Eq. (18), zk+1 = 1 qk+1 [(1 rk)qkzk + rkµyk rk fik(yk)] rk wk rk qk+1 fik(yk) + yk (1 rk)(qk + rkµ) qk+1rk + rkµ rk wk rk qk+1 fik(yk) + yk qk(1 rk) + (rkµ r2 kµ) qk+1rk + r2 kµ qk+1rk rk wk rk qk+1 fik(yk) + yk (qk+1 rkµ) + (rkµ r2 kµ) + r2 kµ qk+1rk (From Eq. (16)) rk [yk ηk fik(yk)] (From Eq. (17)) zk+1 = wk + 1 rk [wk+1 wk] (From Eq. (14)) which recovers Eq. (15) showing that the definition of zk and its update is consistent. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Now we check the equivalence of Eq. (11) and Eq. (16)-Eq. (17). Eliminating qk using Eq. (16)-Eq. (17), r2 k ηk = (1 rk)r2 k 1 ηk 1 + rkµ Multiplying by ηk recovers Eq. (11). Since Eq. (5) and Eq. (14) are equivalent, we need to establish the equivalence of Eq. (4) and the updates in Eq. (13)- Eq. (15). From Eq. (15) zk = wk 1 + 1 rk 1 [wk wk 1] = zk wk = 1 rk 1 rk 1 (wk wk 1) Starting from Eq. (13) and using the above relation to eliminate zk, yk = wk + rkqk qk + rkµ 1 rk 1 rk 1 [wk wk 1] which is in the same form as Eq. (4). We now eliminate qk from rkqk qk+rkµ 1 rk 1 rk 1 . From Eq. (16) and Eq. (17), r2 k ηk = (1 rk)qk + rkµ = qk + rkµ = r2 k ηk + rkqk Using this relation, rkqk qk + rkµ 1 rk 1 rk 1 = qkηk rk + qkηk Using Eq. (17), observe that ηkqk = ηk ηk 1 ηk 1qk = ηk ηk 1 r2 k 1. Using this relation, rkqk qk + rkµ 1 rk 1 ηk ηk 1 r2 k 1 rk + ηk ηk 1 r2 k 1 rk 1 = (1 rk 1)rk 1 ηk + r2 k 1 = bk which establishes the equivalence to Eq. (4) and completes the proof. E.2. Estimating sequences Similar to (Nesterov, 2004; Mishkin, 2020), we will use the estimating sequence {φk, λk} k=1 such that λk (0, 1) and λ1 = 1 ; λk+1 = (1 rk)λk, (19) φk(w) = [inf w φk(w)] + qk 2 w zk 2 , (20) and satisfies the following update condition φk(w) (1 λk)f(w) + λkφ1(w) (21) Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD The above definitions impose the following update for φ k := [infw φk(w)], φ k+1 = (1 rk)φ k + rk f(yk) rk 2qk+1 f(yk) 2 + (1 rk) qk 2 yk zk 2 + f(yk), zk yk (22) Finally note that the definition of φk can be used to rewrite Eq. (13) as yk = wk rk qk + rkµ φk(wk). (23) E.3. Proof of Theorem 6 Given the definitions in Appendix E.2, we first prove the descent lemma for ηk = 1 ρLαk, where αk 1 is the exponentially decreasing step-size. Lemma 12. Using the update in Eq. (14) with ηk = 1 ρLαk, we obtain the following descent condition. E[f(wk+1)] E[f(yk)] ηk 2 f(yk) 2 + 1 2ρ2Lα2 kσ2 Proof. By smoothness, and the update in Eq. (14), f(wk+1) f(yk) ηk f(yk), fik(yk) + L 2 η2 k fik(yk) 2 Taking expectation w.r.t. ik, E[f(wk+1)] E[f(yk)] ηk f(yk) 2 + L 2 η2 k E[ fik(yk) 2] (ηk is independent of ik.) E[f(yk)] ηk f(yk) 2 + ρL 2 η2 k E[ f(yk) 2] + L 2 η2 kσ2 (From Eq. (6)) = E[f(yk)] ηk f(yk) 2 + ηkαk 2 E[ f(yk) 2] + 1 2ρ2Lα2 kσ2 E[f(yk)] ηk 2 f(yk) 2 + 1 2ρ2Lα2 kσ2 The main part of the proof is to show that φ k is an upper-bound on f(wk) (upto a factor governed by the noise term Nk depending on σ2) for all k and is proved in the following lemma. Lemma 13. For the estimating sequences defined in Appendix E.2 and the updates in Eq. (13)-Eq. (18), for all k, E[φ k] := E[inf w φk(w)] E[f(wk)] Nk where Nk := σ2 2ρ2L Pk j=1 α2 j Qk i=j+1(1 ri). Proof. We will prove the lemma by induction. For k = 1, we define φ 1 = f(w1), and since Nk 0 for all k, E[φ 1] f(w1) N1, thus satisfying the base-case for the induction. For the induction, we will use the fact that Nk+1 = (1 rk)Nk + 2σ2 Assuming the induction hypothesis, E[φ k] E[f(wk)] Nk, we use Eq. (22) to prove the statement for k + 1 as follows. Taking expectations w.r.t to the randomness in j = 1 to k, E[φ k+1] = (1 rk)E[φ k] + rk E f(yk) rk 2qk+1 f(yk) 2 + (1 rk) qk 2 yk zk 2 + f(yk), zk yk Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD (1 rk)E[f(wk) Nk] + rk E f(yk) rk 2qk+1 f(yk) 2 + (1 rk) qk 2 yk zk 2 + f(yk), zk yk (by the induction hypothesis) = (1 rk)E[f(wk)] + rk E[f(yk)] r2 k 2qk+1 E f(yk) 2 + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk (1 rk)Nk = (1 rk)E[f(wk)] + rk E[f(yk)] ηk 2 E f(yk) 2 + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk (1 rk)Nk (Using Eq. (17)) By convexity, f(wk) f(yk) + f(yk), wk yk , (1 rk)E[f(yk) + f(yk), wk yk ] + rk E[f(yk)] ηk 2 E f(yk) 2 + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk (1 rk)Nk = E h f(yk) ηk 2 f(yk) 2i + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk + (1 rk)E[ f(yk), wk yk ] (1 rk)Nk By Lemma 12, E f(wk+1) 1 2ρ2Lα2 kσ2 + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk + (1 rk)E[ f(yk), wk yk ] (1 rk)Nk = E [f(wk+1)] + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk + (1 rk)E[ f(yk), wk yk ] (1 rk)Nk + 1 2ρ2Lα2 kσ2 Since Nk+1 = h (1 rk)Nk + 1 2ρ2Lα2 kσ2i , E[φ k+1] E [f(wk+1)] Nk+1 + (1 rk)E[ f(yk), wk yk ] + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk Now we show that (1 rk)E h f(yk), wk yk + rk qk µ 2 yk zk 2 + f(yk), zk yk i 0. For this, we use Eq. (13) yk = wk qkrk qk + rkµ(wk zk) = zk yk = zk wk + qkrk qk + rkµ(wk zk) = 1 qkrk qk + rkµ = qk(1 rk) + rkµ (zk wk) = qk+1 qk + rkµ (zk wk) (By Eq. (16)) Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD qk+1 f(yk), zk yk = f(yk), rk qk qk + rkµ (wk zk) = f(yk), yk wk Using this relation to simplify, (1 rk)E f(yk), wk yk + rk qk 2 yk zk 2 + f(yk), zk yk = (1 rk)E rk qk 2 yk zk 2 + (1 rk) [ f(yk), wk yk + f(yk), yk wk ] = (1 rk)E rk qk 2 yk zk 2 0 (Since rk 1.) Putting everything together, E[φ k+1] E [f(wk+1)] Nk+1 and we conclude that E[φ k] E [f(wk)] Nk for all k by induction. We now use the above lemma to prove Theorem 6. Theorem 6. Under the same assumptions of Theorem 1 and (iii) the growth condition in Eq. (6), ASGD (Eqs. (4) and (5)) with w1 = y1, γk = 1 ρL, αk = β T k/T , rk = q µ ρL β T k/2T and bk = (1 rk 1) rk 1 α rk+r2 k 1 α converges as, T +1 2c3 exp T κρ α ln(T/β) ρµe2 (ln(T/β))2 where k := E[f(wk) f ] and c3 = exp 2β ρκ ln(T/β) . Proof. Using the reformulation in Lemma 11 gives us qk = µ for all k and z1 = w1. For the estimating sequences defined in Appendix E.2, using Lemma 13, we know that the (reformulated) updates satisfy the following relation, E[f(w T +1)] E[φ T +1] + NT +1 E[φT +1(w )] + NT +1 From Eq. (21), we know that for all w and k, φk(w) (1 λk)f(w) + λkφ1(w) Using these relations, E[f(w T +1)] (1 λT )f + λT φ1(w ) + NT +1 = E[f(w T +1) f ] λT [φ1(w ) f ] + NT +1 By Eq. (20), λT h φ 1 + q1 2 w z1 2 f i + NT +1 Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Choosing φ 1 = f(w1), λT h f(w1) f + q1 2 w z1 2i + NT Since z1 = w1, q1 = µ, = E[f(w T +1) f ] λT h f(w1) f + µ 2 w w1 2i + 2σ2 i=j+1 (1 ri) Using the fact that λ1 = 1 and λk+1 = (1 rk)λk, we know that that λT = QT k=1(1 rk), and E[f(w T +1) f ] # h f(w1) f + µ 2 w w1 2i + σ2 i=j+1 (1 ri). Now our task is to upper-bound bound the 1 rk terms. From Eq. (17), we know that rk = qk+1ηk = rqk+1 ρL αk rqk+1 ρL αk (Since αk 1 for all k) = (1 rk) 1 rqk+1 Since qk = µ for all k, putting everything together, E[f(w T +1) f ] # h f(w1) f + µ 2 w w1 2i + σ2 Denoting k = E[f(wk) f ], using the exponential step-size αk = α k/T = 1 T k/T and that f(w1) f µ 2 w w1 2, k=1 α2k exp Using Lemma 5, we can bound the first term as 1 2 exp r 1 αT ln(T/β) 2β ln(T/β) = 2c3 exp T κρ α ln(T/β) where c3 = exp 2β ρκ ln(T/β) . We can now bound the second term by a proof similar to Lemma 6. Indeed we have k=1 α2k exp k=1 α2k exp r 1 ρκ αk+1 αT +1 = exp 1 ρκ αT +1 k=1 α2k exp r 1 exp 1 ρκ αT +1 k=1 α2k 2(1 α) ρκ 2 (Lemma 17) Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD = exp 1 ρκ αT +1 e2α2 T(1 α)2 exp 1 ρκ αT +1 e2α2 T ln(1/α)2 = exp 1 ρκ αT +1 4ρκ ln(T/β)2 exp 1 ρκ αT +1 exp 2β ρκ ln(T/β) where the inequality comes from the bound in Eq. (10) in the proof of Lemma 5. Putting everything together we obtain E[f(w T +1) f ] 2c3 exp T κρ α ln(T/β) [f(w1) f ] + 2σ2c3κ ln(T/β)2 E.4. Proof for misspecified ASGD In this section, we assume that L and µ are misspecified by positive νL and νµ and we set ν = νLνµ. In particular, we will use L and µ, offline estimates of the smoothness and strong-convexity parameters. W.l.o.g we will assume that µ = νµµ and L = L νL . Importantly, we will assume that νµ 1 i.e. we will only consider the more practical case where we underestimate the strong-convexity. Since νµ 1, f is also µ strongly-convex. Hence, all the equations of (11) hold where we replace µ with µ. Similarly in the definition of φ k in (22) we can replace µ with µ. With this estimated value, the extrapolation parameter bk is computed as follows: r2 k = (1 rk)r2 k 1 ηk ηk 1 + rk µηk. (24) bk = (1 rk 1)rk 1 ηk ηk 1 rk + r2 k 1 ηk ηk 1 , (25) where ηk = γkαk = 1 ρ Lαk = νL ρL β T k/T , rk = q β T k/2T = q νµ ρL β T k/2T = q ν ρκ β T k/2T satisfy the above equations. It can be verified that the reformulation in Appendix E.1 does not use the specific form of rk and ηk, and only relies on the consistency of the update above. Hence, the update can be be reformulated as a 3 variable sequence as in Appendix E.1 with a different choice of ηk and rk, but with qk and zk defined analogously in terms of ηk and rk. Similarly, it can be verified that the definition of the estimating sequences in Appendix E.2 also does not depend on the specific value of rk and ηk, and hence we use the same definition of φk. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD E.4.1. PROOF OF THEOREM 7 Given the definitions in Appendix E.2, we first prove the following descent lemma with the misspecified step-size. Lemma 14. Using the update in Eq. (14) with ηk = νL ρLαk, and defining k0 := T [ln(νL)]+ ln(T/β) and G = maxj [k0] f(yj) , we obtain the following descent lemma. E[f(wk+1)] E[f(yk)] ηk 2 f(yk) 2 + ν2 L 2ρ2Lα2 kσ2 + I {k < k0} G2ν2 L 2ρL α2 k Proof. By smoothness, and the update in Eq. (14), f(wk+1) f(yk) ηk f(yk), fik(yk) + L 2 η2 k fik(yk) 2 Taking expectation w.r.t. ik, E[f(wk+1)] E[f(yk)] ηk f(yk) 2 + L 2 η2 k E[ fik(yk) 2] (ηk is independent of the randomness in ik.) E[f(yk)] ηk f(yk) 2 + ρL 2 η2 k f(yk) 2 + L 2 η2 kσ2 (By the growth condition in Eq. (6)) = E[f(yk)] ηk f(yk) 2 + ηkνLαk 2 f(yk) 2 + ν2 L 2ρ2Lα2 kσ2 = E[f(yk)] ηk 2 f(yk) 2 ηk 2 (1 νLαk) f(yk) 2 + ν2 L 2ρ2Lα2 kσ2 For k k0, 1 νLαk 0, implying that E[f(wk+1)] E[f(yk)] ηk 2 f(yk) 2 + ν2 L 2ρ2Lα2 kσ2. For k < k0, E[f(wk+1)] E[f(yk)] ηk 2 f(yk) 2 + ηkνLαk 2 f(yk) 2 + ν2 L 2ρ2Lα2 kσ2 E[f(wk+1)] E[f(yk)] ηk 2 f(yk) 2 + α2 kν2 L 2ρL f(yk) 2 + ν2 L 2ρ2Lα2 kσ2 (Since ηk = νL Since G = maxj [k0] f(yj) , we can further upper-bound the RHS by E[f(wk+1)] E[f(yk)] ηk 2 f(yk) 2 + G2ν2 L 2ρL α2 k + ν2 L 2ρ2Lα2 kσ2. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Let us now prove the equivalent of Lemma 13 using the above modified descent lemma. Lemma 15. For the estimating sequences defined in Appendix E.2 and the updates in Eq. (13)-Eq. (18), E[φ k] := E[inf w φk(w)] E[f(wk)] Nk where Nk := σ2ν2 L 2ρ2L Pk j=1 α2 j Qk i=j+1(1 ri) + G2ν2 L 2ρL Pmin{k0,k} 1 j=1 α2 j Qk i=j+1(1 ri) Proof. We will prove the lemma by induction. For k = 1, we define φ 1 = f(w1), and since Nk 0 for all k, E[φ ] f(w1) N1, thus satisfying the base-case for the induction. For the induction, we will use the fact that Nk+1 = (1 rk)Nk + 2ν2 Lσ2 ρ2L α2 k + I {k < k0} G2ν2 L 2ρL α2 k. Assuming the induction hypothesis, E[φ k] E[f(wk)] Nk, we use Eq. (22) to prove the statement for k + 1 as follows. Taking expectations w.r.t to the randomness in j = 1 to k, E[φ k+1] = (1 rk)E[φ k] + rk E f(yk) rk 2qk+1 f(yk) 2 + (1 rk) qk 2 yk zk 2 + f(yk), zk yk (1 rk)E[f(wk) Nk] + rk E f(yk) rk 2qk+1 f(yk) 2 + (1 rk) qk 2 yk zk 2 + f(yk), zk yk (by the induction hypothesis) = (1 rk)E[f(wk)] + rk E[f(yk)] r2 k 2qk+1 E f(yk) 2 + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk = (1 rk)E[f(wk)] + rk E[f(yk)] ηk 2 E f(yk) 2 + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk (1 rk)Nk (Using Eq. (17)) By convexity, f(wk) f(yk) + f(yk), wk yk , (1 rk)E[f(yk) + f(yk), wk yk ] + rk E[f(yk)] ηk 2 E f(yk) 2 + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk (1 rk)Nk = E h f(yk) ηk 2 f(yk) 2i + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk + (1 rk)E[ f(yk), wk yk ] (1 rk)Nk By Lemma 14, E f(wk+1) ν2 L 2ρ2Lα2 kσ2 I {k < k0} G2ν2 L 2ρL α2 k + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk + (1 rk)E[ f(yk), wk yk ] (1 rk)Nk = E [f(wk+1)] + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk + (1 rk)E[ f(yk), wk yk ] (1 rk)Nk + ν2 L 2ρ2Lα2 kσ2 + I {k < k0} G2ν2 L 2ρL α2 k Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Since Nk+1 = h (1 rk)Nk + ν2 L 2ρ2Lα2 kσ2 + I {k < k0} G2ν2 L 2ρL α2 k i , E[φ k+1] E [f(wk+1)] Nk+1 + (1 rk)E[ f(yk), wk yk ] + rk(1 rk) qk 2 yk zk 2 + f(yk), zk yk Similar to Lemma 13, we show that (1 rk)E h f(yk), wk yk + rk qk µ 2 yk zk 2 + f(yk), zk yk i 0. For this, we use modified Eq. (13) yk = wk qkrk qk + rk µ(wk zk) = zk yk = zk wk + qkrk qk + rk µ(wk zk) = 1 qkrk qk + rk µ = qk(1 rk) + rk µ (zk wk) = qk+1 qk + rk µ (zk wk) (By Eq. (16)) qk+1 f(yk), zk yk = f(yk), rk qk qk + rk µ (wk zk) = f(yk), yk wk Using this relation to simplify, (1 rk)E f(yk), wk yk + rk qk 2 yk zk 2 + f(yk), zk yk = (1 rk)E rk qk µ 2 yk zk 2 + (1 rk) [ f(yk), wk yk + f(yk), yk wk ] = (1 rk)E rk qk µ 2 yk zk 2 0 (Since rk 1.) Putting everything together, E[φ k+1] E [f(wk+1)] Nk+1 and we conclude that E[φ k] E [f(wk)] Nk for all k by induction. Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD We now use the above lemma to prove Theorem 7. Theorem 7. Under the same assumptions as Theorem 6 and (iv) ν = νLνµ ρκ, ASGD (Eqs. (4) and (5)) with w1 = y1, γk = 1 ρ L = νL ρL, αk = β T k/T , µ = νµµ µ, rk = q β T k/2T = q ν ρκ β T k/2T and bk = (1 rk 1) rk 1 α rk+r2 k 1 α converges as, T +1 2c3 exp min{ν, 1}T κρ α ln(T/β) + 2c3(ln(T/β))2 ρ + G2 min{k0 T , 1} max{νL νµ , ν2 L}, where k := E[f(wk) f ], c3 = exp 1 ρκ 2β ln(T/β) , [x]+ = max{x, 0}, k0 := T [ln(νL)]+ ln(T/β) and G = maxj [k0] f(yj) . Proof. In order to have a valid estimate sequence, we need to restrict values that ν. Since ν ρκ, 0 rk 1 and hence λk (0, 1), as required for a valid estimate sequence. For the estimating sequences defined in Appendix E.2, using Lemma 13, we know that the (reformulated) updates satisfy the following relation E[f(w T +1)] E[φ T +1] + NT +1 E[φT +1(w )] + NT +1 From Eq. (21), we know that for all w and k, φk(w) (1 λk)f(w) + λkφ0(w) Using these relations, E[f(w T +1)] (1 λT )f + λT φ1(w ) + NT = E[f(w T +1) f ] λT [φ1(w ) f ] + NT +1 By Eq. (20), λT h φ 1 + q1 2 w z1 2 f i + NT +1 Choosing φ 1 = f(w1), λT h f(w1) f + q1 2 w z1 2i + NT Since z1 = w1, and we set q1 = µ, = E[f(w T +1) f ] λT f(w1) f + µ + σ2ν2 L 2ρ2L i=j+1 (1 ri) + G2ν2 L 2ρL min{k0,T } 1 X i=j+1 (1 ri) Using the fact that λ1 = 1 and λk+1 = (1 rk)λk, and since ν ρκ, we know that rk 1 and λT = QT k=1(1 rk), and E[f(w T +1) f ] # f(w1) f + µ 2 w w1 2 + σ2ν2 L 2ρ2L i=j+1 (1 ri) + G2ν2 L 2ρL min{k0,T } 1 X i=j+1 (1 ri). Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD Now our task is to upper-bound bound the 1 rk terms. From Eq. (17), we know that rk = qk+1ηk = rqk+1νL ρL αk rqk+1νL ρL αk (Since αk 1 for all k) = (1 rk) 1 rqk+1νL Since qk = µ for all k, putting everything together, E[f(w T +1) f ] # f(w1) f + µ 2 w w1 2 + σ2ν2 L 2ρ2L + G2ν2 L 2ρL min{k0,T } 1 X Denoting k = E[f(wk) f ] and 1 = E[f(w1) f ], using the exponential step-size αk = α k/T = 1 T k/T and that f(w1) f µ 2 w1 w 2. Similar to the proof of Theorem 4, if ν > 1, we can replace ν by 1 and get an upper-bound for the 1 q ν ρκαk term. Hence, we define ˆν := min{1, ν}, and obtain the following upper-bound. 1 + σ2ν2 L 2ρ2L k=1 α2k exp + G2ν2 L 2ρL min{k0,T } 1 X k=1 α2k exp Using Lemma 5, we can bound the first term as αT ln(T/β) 2β ln(T/β) 2 exp 2β ρκ ln(T/β) ˆν κρ α ln(T/β) [f(w1) f ] (since ˆν 1) ˆν κρ α ln(T/β) where c3 = exp 2β ρκ ln(T/β) . We can now bound the second term by a proof similar to Lemma 6. Indeed we have k=1 α2k exp k=1 α2k exp ˆν ρκ αk+1 αT +1 ˆν ρκ αT +1 k=1 α2k exp ˆν ρκ αT +1 k=1 α2k 2(1 α) ρκ 2 (Lemma 17) ˆν ρκ αT +1 ! 4ρκ e2ˆνα2 T(1 α)2 Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD ˆν ρκ αT +1 ! 4ρκ e2ˆνα2 T ln(1/α)2 ˆν ρκ αT +1 ! 4ρκ ln(T/β)2 min{k0,T } 1 X k=1 α2k exp ˆν ρκ αT +1 ! 4ρκ ln(T/β)2 min{k0, T} ˆν ρκ αT +1 ! 4ρκ ln(T/β)2 min n ln(ν) ln(T/β), 1 o ˆν ρκ αT +1 ˆν ρκ ln(T/β) where the inequality comes from the bound in Eq. (10) in the proof of Lemma 5. Putting everything together we obtain E[f(w T +1) f ] 2c3 exp ˆνT κρ α ln(T/β) 1 + 2c3κ ln(T/β)2 e2α2T σ2ν2 L ˆνρL + 2c3κ ln(T/β)2 e2α2T min [ln(νL)]+ ln(T/β) , 1 G2ν2 L Lˆν ˆνT κρ α ln(T/β) 1 + 2c3κ ln(T/β)2 e2α2T σ2 max{ν2 L, νL/νµ} ρL + 2c3κ ln(T/β)2 e2α2T min [ln(νL)]+ ln(T/β) , 1 G2 max{ν2 L, νL/νµ} L ˆνT κρ α ln(T/β) 1 + 2c3 ln(T/β)2 e2α2T σ2 max{ν2 L, νL/νµ} ρµ + 2c3 ln(T/β)2 e2α2T min [ln(νL)]+ ln(T/β) , 1 G2 max{ν2 L, νL/νµ} µ For the second inequality, we consider the term ν2 L ˆν . If ˆν = ν, then ν2 L ˆν = ν2 L ν = νL νµ . If ˆν = 1, then ν2 L ˆν = ν2 L. Putting these two cases together we get max{ν2, νL/νµ}. The last equality comes from the fact that κ Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD F. Helper Lemmas Lemma 16. For all x > 1, 1 x 1 2 ln(x) Proof. For x > 1, we have 1 x 1 2 ln(x) ln(x) < 2x 2 Define f(x) = 2x 2 ln(x). We have f (x) = 2 1 x. Thus for x 1, we have f (x) > 0 so f is increasing on [1, ). Moreover we have f(1) = 2 2 ln(1) = 0 which shows that f(x) 0 for all x > 1 and ends the proof. Lemma 17. For all x, γ > 0, Proof. Let x > 0. Define f(γ) = γ ex γ exp( x). We have f(γ) = exp (γ ln(γ) γ ln(ex)) exp( x) f (γ) = γ 1 γ + ln(γ) ln(ex) exp (γ ln(γ) γ ln(ex)) f (γ) 0 1 + ln(γ) ln(ex) 0 γ exp (ln(ex) 1) = x So f is decreasing on (0, x] and increasing on [x, ). Moreover, x exp( x) = 1 x exp( x) = 0 and thus f(γ) 0 for all γ > 0 which proves the lemma. Lemma 18. For any sequence αk k=1 (1 αk) + i=k+1 (1 αi) = 1 Proof. We show this by induction on T. For T = 1, (1 α1) + α1 = 1 Induction step: k=1 (1 αk) + i=k+1 (1 αi) = (1 αT +1) k=1 (1 αk) + i=k+1 (1 αi) = (1 αT +1) k=1 (1 αk) + αT +1 + (1 αT +1) i=k+1 (1 αi) Towards Noise-adaptive, Problem-adaptive (Accelerated) SGD = (1 αT +1) k=1 (1 αk) + i=k+1 (1 αi) (Induction hypothesis) = (1 αT +1) + αT +1 = 1 G. Comparison to Aybat et al. (2019) Aybat et al. (2019) use the condition Ei fi(w) f(w) 2 σ2 + η2 w w 2 (26) In order to relate the above condition to the one in Eq. (6), we use smoothness of f, E fik(xk) f(xk) 2 = E fik(xk) 2 f(xk) 2 σ2 + (ρ 1) f(xk) 2 σ2 + (ρ 1)L2 xk x 2 , and thus η2 = (ρ 1)L2. Similarly, we can use the strong-convexity of f, and obtain the following relation, E fik(xk) 2 = E fik(xk) f(xk) 2 + f(xk) 2 σ2 + η2 xk x 2 + f(xk) 2 µ2 f(xk) 2 + f(xk) 2 , and thus ρ = η2 Hence, the two assumptions are equivalent up to constants. In order to get an idea about the strength of the two results, we compare the two upper-bounds (we note that such a comparison is not ideal, and it may be possible to improve and tailor the upper-bounds to the specific growth condition). Under the assumption in Eq. (26), in Appendix K, the authors prove the following rate, E[f(w T ) f ] O(1) exp T O(1)( κ + η2/µ2) (f(w0) f ) + σ2 If we were to use Eq. (6), since η2 = (ρ 1)L2, this result gives the following upper-bound, E[f(w T ) f ] O(1) exp T O(1)( κ + (ρ 1)κ2) (f(x0) f ) + σ2 Observe that for ρ > 1, this bound does not result in an accelerated rate. On the other hand, if we were to use Eq. (26), since ρ = η2 µ2 + 1 , the result in Theorem 6 can be written as: E[f(w T ) f ] 2c3 exp κ(1 + η2/µ2) (f(w0) f ) + 8σ2c4κ (η2/µ2)Le2 (ln(T/β))2 Observe that for any η > 1, this bound gives an accelerated rate, meaning that the accelerated result in Theorem 6 also holds for the alternative growth condition in Eq. (26).