# high_probability_convergence_of_stochastic_gradient_methods__84f7e5cf.pdf High Probability Convergence of Stochastic Gradient Methods Zijian Liu * 1 Ta Duy Nguyen * 2 Thien Hang Nguyen * 3 Alina Ene 2 Huy L. Nguyen 3 In this work, we describe a generic approach to show convergence with high probability for both stochastic convex and non-convex optimization with sub-Gaussian noise. In previous works for convex optimization, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations. The method can be applied to the non-convex case. We demonstrate an O((1+σ2 log(1/δ))/T +σ/ T) convergence rate when the number of iterations T is known and an O((1 + σ2 log(T/δ))/ T) convergence rate when T is unknown for SGD, where 1 δ is the desired success probability. These bounds improve over existing bounds in the literature. We also revisit Ada Grad-Norm (Ward et al., 2019) and show a new analysis to obtain a high probability bound that does not require the bounded gradient assumption made in previous works. The full version of our paper contains results for the standard per-coordinate Ada Grad. 1. Introduction Stochastic optimization is a fundamental area with extensive applications in many domains, ranging from machine learning to algorithm design and beyond. The design and analysis of iterative methods for stochastic optimization has been the focus of a long line of work, leading to a rich understanding *Equal contribution 1Stern School of Business, New York University 2Department of Computer Science, Boston University 3Khoury College of Computer Sciences, Northeastern University. Correspondence to: Zijian Liu , Ta Duy Nguyen , Thien Hang Nguyen . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). of the convergence of paradigmatic iterative methods such as stochastic gradient descent, mirror descent, and accelerated methods for both convex and non-convex optimization. However, most of these works only establish convergence guarantees that hold only in expectation. Although very meaningful, these results do not fully capture the convergence behaviors of the algorithms when we perform only a small number of runs of the algorithm, as it is typical in modern machine learning applications where there are significant computational and statistical costs associated with performing multiple runs of the algorithm (Harvey et al., 2019; Madden et al., 2020; Davis et al., 2021). Thus an important direction is to establish convergence guarantees for a single run of the algorithm that hold not only in expectation but also with high probability. Compared to the guarantees that hold in expectation, high probability guarantees are significantly harder to obtain and they hold in more limited settings with stronger assumptions on the problem settings and the stochastic noise distribution. Most existing works that establish high probability guarantees focus on the setting where the length of the stochastic noise follows a light-tail (sub-Gaussian) distribution (Juditsky et al., 2011; Lan, 2012; 2020; Li & Orabona, 2020; Madden et al., 2020; Kavis et al., 2021). Recent works also study the more challenging heavy-tail setting, notably under a bounded variance (Nazin et al., 2019; Gorbunov et al., 2020; Cutkosky & Mehta, 2021) or bounded p-moment assumption (Cutkosky & Mehta, 2021) on the length of the stochastic noise. Both settings are highly relevant in practice (Zhang et al., 2020). Despite this important progress, the convergence of cornerstone methods is not fully understood even in the more structured light-tailed noise setting. Specifically, the existing works for both convex and non-convex optimization with light-tailed noise rely on strong assumptions on the optimization domain and the gradients that significantly limit their applicability: The problem domain is restricted to either the unconstrained domain or a constrained domain with bounded Bregman diameter. The convergence guarantees established depend on the Bregman diameter of the domain instead of the initial distance to the optimum. Even for compact domains, since the diameter can be much larger than the initial distance, High Probability Convergence of Stochastic Gradient Methods these guarantees are pessimistic and diminish the benefits of good initializations. Thus an important direction remains to establish high probability guarantees for general optimization that scale only with the initial Bregman distance. The gradients or stochastic gradients are assumed to be bounded even in the smooth setting. These additional assumptions are very restrictive and they significantly limit the applicability of the algorithm, e.g., they do not apply to important settings such as quadratic optimization. Moreover, the stochastic gradient assumption is more restrictive than other commonly studied assumptions, such as the gradients and the stochastic noise being bounded almost surely. The above assumptions are not merely a technical artifact, and they stem from very important considerations. The high probability convergence guarantees are established via martingale concentration inequalities that impose necessary conditions on how much the martingale sequence can change in each step. However, the natural martingale sequences that arise in optimization depend on quantities such as the distance between the iterates and the optimum and the stochastic gradients, which are not a priori bounded. The aforementioned assumptions ensure that the concentration inequalities can be readily applied due to the relevant stochastic terms being all bounded almost surely. These difficulties are even more pronounced for the analysis of adaptive algorithms in the Ada Grad family that set the step sizes based on the stochastic gradients. The adaptive step sizes introduce correlations between the step sizes and the update directions, and a crucial component is the analysis of the evolution of the adaptive step sizes and the cumulative stochastic noise. If the gradients are bounded, both of these challenges can be overcome by paying error terms proportional to the lengths of the gradients and stochastic gradients. Removing the bounded gradient assumptions requires new technical insights and tools. In addition to requiring stronger assumptions, due to the technical challenges involved, several of the prior works are only able to establish convergence guarantees that do not match the ideal sub-Gaussian rates. For example, a common approach is to control the relevant quantities across all T iterations of the algorithm via repeated applications of the concentration inequalities, leading to convergence rates that have additional factors that are poly-logarithmic in T. Additionally, achieving noise-adaptive rates that smoothly interpolate between the faster rate in the deterministic setting and the state of the art rate in the stochastic setting is very challenging with existing techniques. This work aims to contribute to this line of work and overcome the aforementioned challenges. To this end, we introduce a novel generic approach to show convergence with high probability under sub-Gaussian gradient noise. Our approach is very general and flexible, and it can be used both in the convex and non-convex setting. Using our approach, we establish high-probability convergence guarantees for several fundamental settings: In the convex setting, we analyze stochastic mirror descent and stochastic accelerated mirror descent for general optimization domains and Bregman distances, and we analyze the classical algorithms without any changes. These well studied algorithms encompass the main algorithmic frameworks for convex optimization with non-adaptive step sizes (Lan, 2020). Our convergence guarantees scale with only the Bregman distance between the initial point and the optimum, and thus they can leverage good initializations. Our high-probability convergence rates are analogous to known results for convergence in expectation (Juditsky et al., 2011; Lan, 2012). The algorithms are universal for both Lipschitz functions and smooth functions. In the non-convex setting, we analyze the SGD as well as the Ada Grad-Norm algorithm (Ward et al., 2019). Compared to existing works for SGD (Madden et al., 2020; Li & Orabona, 2020), our rates have better dependency on the time horizon and the success probability. For Ada Grad-Norm, our approach allows us to remove the restrictive assumption on the gradients as made in previous work (Kavis et al., 2021). In the full version of our paper1, using a slightly different technique, we give a high probability convergence of the standard per-coordinate Ada Grad (Duchi et al., 2011). To the best of our knowledge, this is the first result for high probability convergence of Ada Grad. 1.1. Our techniques Compared to prior works that rely on black-box applications of martingale concentration inequalities such as Freedman s inequality and its extensions (Harvey et al., 2019; Madden et al., 2020), in this work we introduce a white-box concentration argument that leverages existing convergence analyses for first-order methods. More precisely, the highlevel approach is to define a novel martingale sequence derived from the standard convergence analyses and derive concentration results for this sequence from first principles. By leveraging the structure of the optimization problem, we are able to overcome the aforementioned key difficulties associated with black-box applications of martingale concentration results: these concentration results require certain important conditions on how much the martingale sequence can change, which are generally not a priori satisfied for the natural martingales that arise in optimization. By seamlessly combining the optimization and probability toolkits, we obtain a flexible analysis template that allows us to handle general optimization domains with very large or even unbounded diameter, general objectives that are not globally Lipschitz, and adaptive step sizes. 1Available at https://arxiv.org/abs/2302.14843 High Probability Convergence of Stochastic Gradient Methods Our technique is inspired by classical works in concentration inequalities, specifically a type of martingale inequalities where the variance of the martingale difference is bounded by a linear function of the previous value. This technique is first applied to showing high probability convergence by Harvey et al. (2019) in the strongly convex setting. Our proof is inspired by the proof of Theorem 7.3 by Chung & Lu (2006). In each time step with iterate xt, let ξt := b f (xt) f (xt) be the error in our gradient estimate. Classical proofs of convergence evolve around analyzing the sum of ξt, x xt , which can be viewed as a martingale sequence. Assuming a bounded domain, the concentration of the sum can be shown via classical martingale inequalities. The key new insight is that instead of analyzing this sum, we analyze a related sum where the coefficients decrease over time to account for the fact that we have a looser grip on the distance to the optimal solution as time increases. Nonetheless, the coefficients are kept within a constant factor of each other and the same asymptotic convergence is attained with high probability. 1.2. Related work Convex optimization: Nemirovski et al. (2009); Lan (2012) establish high probability bounds for stochastic mirror descent and accelerated stochastic mirror descent with sub Gaussian noise. The rates shown in these works match the best rates known in expectation, but they depend on the Bregman diameter maxx,y X Dψ (x, y) of the domain, which can be unbounded. Our work complements the analysis with a novel concentration argument that allows us to establish convergence with respect to the distance Dψ (x , x1) from the initial point. Our analysis applies to the general setting considered in (Lan, 2020) and we use the same sub-Gaussian assumption on the noise. The works by Nazin et al. (2019); Gorbunov et al. (2020) and Parletta et al. (2022) consider the more general setting of bounded variance noise. However, their problem settings are more restricted than ours. Specifically, Nazin et al. (2019) analyze stochastic mirror descent only in the setting where the optimization domain has bounded Bregman diameter. Parletta et al. (2022) analyze modifications of stochastic gradient descent, but only for problems with bounded domains. The work by Gorbunov et al. (2020) for smooth functions and by Gorbunov et al. (2021) for nonsmooth functions, analyze stochastic gradient descent and accelerated stochastic gradient descent with gradient clipping, for unconstrained optimization with the ℓ2 setup. In contrast, our work addresses the sub-Gaussian noise setting but it applies to general optimization, and we analyze the classical stochastic mirror descent and accelerated mirror descent without any modifications and with general Bregman distances and optimization domains. The algorithm of Davis et al. (2021) is restricted to wellconditioned objectives that are both smooth and strongly convex, and do not apply to general convex optimization. Additionally, compared to classical methods such as SGD and stochastic mirror descent, the proposed algorithm solves an auxiliary optimization problem in each iteration and is thus more computationally expensive. The high-probability convergence of SGD is studied in Kakade & Tewari (2008); Rakhlin et al. (2011); Hazan & Kale (2014); Harvey et al. (2019); Dvurechensky & Gasnikov (2016). These works either assume that the function is strongly convex or the domain is compact. In contrast, our work applies to nonstrongly convex optimization with a general domain. Non-convex optimization: Li & Orabona (2020) demonstrate a high probability bound for an SGD algorithm with momentum while Madden et al. (2020) and Li & Liu (2022) show for the vanilla SGD and generalize to the family of sub-Weibull noise. However, the existing bounds are not optimal, which we improve in our work, using a very different approach. Convergence in high probability of algorithms with adaptive step sizes for non-convex problems has also been studied, for example, by Li & Orabona (2020); Kavis et al. (2021). We note that the algorithm in (Li & Orabona, 2020) is not fully adaptive due to the dependence of the initial step size on the problem parameters, whereas in (Kavis et al., 2021) the gradients or stochastic gradients are required to be uniformly bounded almost surely. Using techniques from Liu et al. (2022) and extending the argument for SGD in Section 4.1, we are able to establish convergence in high probability of the vanilla version of Ada Grad-Norm (Ward et al., 2019; Faw et al., 2022) without any of these additional assumptions. We provide a more detailed comparison with prior work in the subsequent sections. High probability convergence in the heavy-tail noise regime has also been studied. However, instead of analyzing existing algorithms, most works propose new algorithms which usually require gradient clipping to ensure convergence. Zhang et al. (2020) propose a gradient clipping algorithm that converges in expectation for noise distributions with heavier tails that satisfy the assumption that the p-moments are bounded for 1 < p 2. Cutkosky & Mehta (2021) propose a more complex clipped SGD algorithm with momentum under the same noise assumption, for which they show a high probability convergence. In another line of works, Zhang & Cutkosky (2022) consider parameter-free algorithms that adapt to the initial distance in the heavy tail regime. In contrast, we focus here on vanilla algorithms that have been successfully employed, including stochastic mirror descent, stochastic gradient descent and Ada Grad-Norm with sub-Gaussian noise, and fill in the missing pieces in the literature. We believe our techniques are general and they may lead to further progress in the heavy tailed setting, and we leave this direction to future work. High Probability Convergence of Stochastic Gradient Methods 2. Preliminaries We consider the problem minx X f(x) where f : Rd R is the objective function and X is the domain of the problem. In the convex case, we consider the general setting where f is potentially not strongly convex and the domain X is convex but not necessarily compact. The distance between solutions in X is measured by a general norm . Let denote the dual norm of . In the non-convex case, we consider the setting where X is Rd and is the ℓ2 norm. In this paper, we use the following assumptions: (1) Existence of a minimizer: There exists x = arg minx X f(x). (2) Unbiased estimator: We assume to have access to a history independent, unbiased gradient estimator b f(x) for any x X, that is, E h b f(x) | x i = f(x). (3) Sub-Gaussian noise: b f(x) f(x) is a σ-sub Gaussian random variable (Definition 2.1). There are several equivalent definitions of sub-Gaussian random variables up to an absolute constant scaling (see, e.g., Proposition 2.5.2 in (Vershynin, 2018)). For convenience, we use the following property as the definition. Definition 2.1. A random variable X is σ-sub-Gaussian if E exp λ2X2 exp λ2σ2 λ such that |λ| 1 We will also use the following helper lemma whose proof we defer to the Appendix. Lemma 2.2. For any a 0, 0 b 1 2σ and an σ-sub Gaussian random variable X, E 1 + b2X2 + (a X + b2X2)i exp 3 a2 + b2 σ2 . When b = 0, the upper bound improves to exp 2a2σ2 . 3. Convex case: Stochastic Mirror Descent and Accelerated Stochastic Mirror Descent In this section, we analyze the Stochastic Mirror Descent algorithm (Algorithm 1) and Accelerated Stochastic Mirror Descent algorithm (Algorithm 2) for convex optimization. We define the Bregman divergence Dψ (x, y) = ψ (x) ψ (y) ψ (y) , x y where ψ : Rd R is a 1-strongly convex mirror map with respect to on X. We remark that the domain of ψ is defined as Rd for simplicity, though it is not necessary. Algorithm 1 Stochastic Mirror Descent Algorithm Parameters: initial point x1 X, step sizes {ηt}, strongly convex mirror map ψ for t = 1 to T: xt+1 = arg minx X n ηt D b f (xt) , x E + Dψ (x, xt) o T PT t=1 xt 3.1. Analysis of Stochastic Mirror Descent The end result of this section is the convergence guarantee of Algorithm 1 for constant step sizes (when the time horizon T is known) and time-varying step sizes (when T is unknown) presented in Theorem 3.1. However, we will emphasize presenting the core idea of our approach, which will serve as the basis for the analysis in subsequent sections. For simplicity, here we consider the non-smooth setting, and assume that f is G-Lipschitz continuous, i.e., we have f(x) G for all x X. However, this is not necessary. The analysis for the smooth setting follows via a simple modification to the analysis presented here as well as the analysis for the accelerated setting given in the next section. Theorem 3.1. Assume f is G-Lipschitz continuous and satisfies Assumptions (1), (2), (3), with probability at least 1 δ, the iterate sequence (xt)t 1 output by Algorithm 1 satisfies (1) Setting ηt = r Dψ(x ,x1) 6(G2+σ2(1+log( 1 δ)))T , then Dψ (x , x T +1) 4Dψ (x , x1), and t=1 (f (xt) f (x )) Dψ (x , x1) G2 + σ2 1 + log 1 (2) Setting ηt = r Dψ(x ,x1) 6(G2+σ2(1+log( 1 δ)))t, then Dψ (x , x T +1) 2(2 + log T)Dψ (x , x1), and t=1 (f (xt) f (x )) 2 T (2 + log T) Dψ (x , x1) G2 + σ2 1 + log 1 We define ξt := b f (xt) f (xt) and let Ft = σ (ξ1, . . . , ξt 1) denote the natural filtration. Note that xt is Ft-measurable. The starting point of our analysis is the following inequality that follows from the standard stochastic mirror descent analysis (see, e.g., (Lan, 2020)). We include the proof in the Appendix for completeness. High Probability Convergence of Stochastic Gradient Methods Lemma 3.2. (Lan, 2020) For every iteration t, we have At := ηt (f (xt) f (x )) η2 t G2 + Dψ (x , xt+1) Dψ (x , xt) ηt ξt, x xt + η2 t ξt 2 . We now turn our attention to our main concentration argument. Towards our goal of obtaining a high-probability convergence rate, we analyze the moment generating function for a random variable that is closely related to the left-hand side of the inequality above. We let {wt} be a sequence where wt 0 for all t. We define Zt = wt At vt Dψ (x , xt) , 1 t T, where vt = 6σ2η2 t w2 t , i=t Zi, 1 t T + 1. Before proceeding with the analysis, we provide intuition for our approach. If we consider S1, we see that it combines the gains in function value gaps with weights given by the sequence {wt} and the losses given by the Bregman divergence terms Dψ (x , xt) with coefficients vt chosen based on the step size ηt and wt. The intuition here is that we want to transfer the error from the stochastic error terms on the RHS of Lemma 3.2 into the loss term vt Dψ (x , xt) then leverage the progression of the Bregman divergence Dψ (x , xt+1) Dψ (x , xt) to absorb this loss. For the first step, we can do that by setting the coefficient vt to equalize coefficient of divergence term that will appear from the RHS of Lemma 3.2. For the second step, we can aim at making all the divergence terms telescope, by selecting vt and wt such that wt + vt wt 1 to have a telescoping sum of the terms wt Dψ (x , xt+1) wt 1Dψ (x , xt). In the end we will obtain a bound for the function value gaps in terms of only the deterministic quantities, namely ηt, wt, G and the initial distance. In Theorem 3.3, we upper bound the moment generating function of S1 and derive a set of conditions for the weights {wt} that allow us to absorb the stochastic errors. In Corollary 3.4, we show how to choose the weights {wt} and obtain a convergence rate that matches the standard rates that hold in expectation. We now give our main concentration argument that bounds the moment generating function of St inspired by the proof of Theorem 7.3 in (Chung & Lu, 2006). Theorem 3.3. Suppose that wtη2 t 1 4σ2 for every 1 t T. For every 1 t T + 1, we have E [exp (St) | Ft] exp Proof. We proceed by induction on t. Consider the base case t = T + 1. We have the inequality holds true trivially. Next, we consider 1 t T. We have E [exp (St) | Ft] = E [exp (Zt + St+1) | Ft] = E [E [exp (Zt + St+1) | Ft+1] | Ft] . (1) We now analyze the inner expectation. Conditioned on Ft+1, Zt is fixed. Using the inductive hypothesis, we obtain E [exp (Zt + St+1) | Ft+1] exp (Zt) exp i=t+1 wiη2 i Plugging into (1), we obtain E [exp (St) | Ft] E [exp (Zt) | Ft] exp i=t+1 wiη2 i By Lemma 3.2 exp (Zt) = exp wt ηt (f (xt) f (x )) η2 t G2 + Dψ (x , xt+1) Dψ (x , xt) vt Dψ (x , xt) exp wtηt ξt, x xt + wtη2 t ξt 2 exp ( vt Dψ (x , xt)) . Next, we analyze the first term in the last line of the above inequality in expectation. Let Xt = ξt, x xt . Using Taylor expansion of ex, and that E [Xt | Ft] = 0, we have E h exp wt Xt + wtη2 t ξt 2 | Ft i 1 + wtη2 t ξt 2 wt Xt + wtη2 t ξt 2 i | Ft 1 + wtη2 t ξt 2 wtηt x xt ξt + wtη2 t ξt 2 i | Ft (b) exp 3σ2 w2 t η2 t x xt 2 + wtη2 t (c) exp 3σ2 2w2 t η2 t Dψ (x , xt) + wtη2 t . (4) For (a), we use Cauchy-Schwartz and obtain Xt = ηt ξt, x xt ηt ξt x xt . For (b), we apply Lemma 2.2 with X = ξt , a = wtηt x xt , and High Probability Convergence of Stochastic Gradient Methods b2 = wtη2 t 1 4σ2 . For (c), we use that Dψ (x , xt) 1 2 x xt 2 from the strong convexity of ψ. Plugging back into (3) and using that vt = 6σ2η2 t w2 t , we obtain the desired inequality E [exp (St) | Ft] 6σ2η2 t w2 t vt Dψ (x , xt) + 3σ2 T X Using Markov s inequality, we obtain the following convergence guarantee. Corollary 3.4. Suppose the sequence {wt} satisfies the conditions of Theorem 3.3 and that wt + 6σ2η2 t w2 t wt 1. For any δ > 0, with probability at least 1 δ: t=1 wtηt (f (xt) f (x )) + w T Dψ (x , x T +1) w0Dψ (x , x1) + G2 + 3σ2 T X t=1 wtη2 t + log 1 With the above result in hand, we complete the convergence analysis by showing how to define the sequence {wt} with the desired properties. Below we give the choice of ηt and wt for fixed step sizes. The choice for time-varying step sizes can be found in Corollary B.1 in the appendix. Corollary 3.5. Suppose we run the Stochastic Mirror Descent algorithm with fixed step sizes ηt = η T . Let w T = 1 12σ2η2 and wt 1 = wt + 6 T σ2η2w2 t for all 1 t T. The sequence {wt} satisfies the conditions required by Corollary 3.4. By Corollary 3.4, for any δ > 0, the following events hold with probability at least 1 δ: Dψ (x , x T +1) 2Dψ (x , x1) + 12 G2 + σ2 1 + log 1 t=1 (f (xt) f (x )) 1 T 2Dψ (x , x1) G2 + σ2 1 + log 1 In particular, setting ηt = r Dψ(x ,x1) 6(G2+σ2(1+log( 1 δ)))T we ob- tain the first case of Theorem 3.1. Proof. Recall from Corollary 3.4 that the sequence {wt} needs to satisfy the following conditions for all 1 t T: wt + 6σ2η2 t w2 t wt 1; and wtη2 t 1 4σ2 . Algorithm 2 Accelerated Stochastic Mirror Descent Algorithm (Lan, 2020). Parameters: initial point x0 = y0 = z0 X, step size η, strongly convex mirror map ψ for t = 1 to T: Set αt = 2 t+1 xt = (1 αt) yt 1 + αtzt 1 zt = arg minx X ηt D b f(xt), x E + Dψ (x, zt 1) yt = (1 αt) yt 1 + αtzt return y T Let C = 6σ2η2. We set w T = 1 C+6σ2η2 = 1 2C . For 1 t T, we set wt so that the first condition holds with equality wt 1 = wt + 6σ2w2 t η2 t = wt + 6 T σ2η2w2 t . We can show by induction that, for every 1 t T, The base case t = T follows from the definition of w T . Consider 1 t T. Using the definition of wt 1 and the inductive hypothesis, we obtain wt 1 = wt + 6 T σ2η2t + 6σ2η2 T σ2η2t C + 6 T σ2η2(t 1) C + 6 T σ2η2 (t 1) C + 6 T σ2η2 (t 1) as needed. This fact implies the second condition as follows: wtη2 t = wt η2 6σ2η2t = 1 6σ2 . Thus, using Corollary 3.4, w T = 1 2C , and 1 2C wt 1 C for all 0 t T, we obtain the desired inequalities. 3.2. Analysis of Accelerated Stochastic Mirror Descent In this section, we extend the analysis detailed in the previous section to analyze the Accelerated Stochastic Mirror Descent Algorithm (Algorithm 2). We assume that f satisfies the following condition: for all x, y X f(y) f(x) + f (x) , y x + G y x + L 2 y x 2 . (5) High Probability Convergence of Stochastic Gradient Methods Note that L-smooth functions, G-Lipschitz functions, and their sums all satisfy the above condition. Here, we obtain the following guarantees in Theorem 3.6. Theorem 3.6. Assume f satisfies Assumptions (1), (2), (3) and condition (5). Then, with probability at least 1 δ, the output y T of the Accelerated Stochastic Mirror Descent algorithm (Algorithm 2) satisfies (1) Setting ηt = min G2+σ2(1+log( 1 then Dψ (x , z T ) 4Dψ (x , z0) and f (y T ) f (x ) 16LDψ (x , z0) Dψ (x , z0) G2 + 1 + log 1 (2) Setting ηt = min G2+σ2(1+log( 1 then Dψ (x , z T ) 2(2 + log T)Dψ (x , z0) and f (y T ) f (x ) 16LDψ (x , z0) 6(2 + log T) Dψ (x , z0) G2 + 1 + log 1 We will only highlight the application of the previous analysis here. Define ξt := b f (xt) f (xt). We start with the inequalities shown in the standard analysis, e.g, from (Lan, 2020) (proof in the Appendix). Lemma 3.7. (Lan, 2020) For every iteration t, we have αt (f (yt) f (x )) αt (f (yt 1) f (x )) η2 t 1 Lαtηt G2 + Dψ (x , zt) Dψ (x , zt 1) ηt ξt, x zt 1 + η2 t 1 Lαtηt ξt 2 . We now turn our attention to our main concentration argument. Similar to the previous section, we define Zt = wt Bt vt Dψ (x , zt 1) , 1 t T, where vt = 6σ2w2 t η2 t , i=t Zi, 1 t T + 1. Notice that we are following the same steps as in the previous section. By transferring the error terms in the RHS of Lemma 3.7 into the Bregman divergence terms Dψ (x , zt 1), we can absorb them by setting the coefficients appropriately. In the same manner, we can show the following Theorem: Theorem 3.8. Suppose that wtη2 t 1 Lαtηt 1 4σ2 for every 0 t T. Then, for every 1 t T + 1, we have E [exp (St) | Ft] exp i=t wi η2 i 1 Lαiηi Corollary 3.9. Suppose the sequence {wt} satisfies the conditions of Theorem 3.8. For any δ > 0, the following event holds with probability at least 1 δ: ηt αt (f (yt) f (x )) αt (f (yt 1) f (x )) + w T Dψ (x , z T ) w0Dψ (x , z0) + G2 + 3σ2 T X t=1 wt η2 t 1 Lαtηt + log 1 With the above result in hand, we can complete the convergence analysis by showing how to define the sequence {wt} with the desired properties. For the algorithm with known T, we set αt = 2 t+1, ηt = ηt for η 1 4L, w T = 1 3σ2η2T (T +1)(2T +1) and wt 1 = wt + 6σ2η2 t w2 t for all 1 t T. For the algorithm with unknown T, we set αt = 2 t+1, ηt = min{ t t}, w T = 1 12σ2( PT t=1 η2 t) and wt 1 = wt + 6σ2η2 t w2 t for all 1 t T. In the Appendix, we show that these choices have the desired properties (Corollaries B.2 and B.3). 4. Nonconvex case: Stochastic Gradient Descent and Ada Grad-Norm In this section, we analyze the Stochastic Gradient Descent (SGD) algorithm (Algorithm 3) and the adaptive version, commonly known as Ada Grad-Norm (Algorithm 4) for nonconvex optimization, where we look to find an approximate stationary point of f. Here, we assume that the optimization problem has domain X = Rd, and that f is an L-smooth function, i.e., the gradients of f is L-Lipschitz: f(x) f(y) L x y , x, y Rd. This implies the following inequality on f at any x, y Rd: f(y) f(x) f(x), y x + L 2 y x 2 . (6) High Probability Convergence of Stochastic Gradient Methods Algorithm 3 Stochastic Gradient Descent (SGD) Parameters: initial point x1, step sizes {ηt} for t = 1 to T do xt+1 = xt ηt b f(xt) 4.1. Analysis of Stochastic Gradient Descent In this section, we provide a high probability analysis of SGD (Algorithm 3) that is tighter than previous works. Our main result is presented in Theorem 4.1. Theorem 4.1. Assume that f is L-smooth and satisfies Assumptions (1), (2), (3), and let 1 := f(x1) f(x ). Then, with probability at least 1 δ, the iterate sequence (xt)t 1 output by Algorithm 3 satisfies (1) Setting ηt = min 1 L; q t=1 f(xt) 2 2 1L T + 12σ2 log 1 (2) Setting ηt = 1 L t=1 f(xt) 2 2 1L + 3σ2 (1 + log T) + 12σ2 log 1 Comparison with prior work: When the time horizon T is known to the algorithm, by choosing the step size η in part (1) of Theorem 4.1, the bound is adaptive to noise, i.e, when σ = 0 we recover O( 1 T ) convergence rate of the (deterministic) gradient descent algorithm. Notice that the bound in this case does not have a log T term incurred. When T is unknown, the extra log T appears as a result of setting a time-varying step size ηt = 1 L t. This log T appears as an additive term to the log 1 δ term, as opposed to being multiplicative, i.e, log T log 1 δ as in previous works (Li & Orabona, 2020; Madden et al., 2020; Li & Liu, 2022). To proceed to the analysis, we define for t 1 t := f(xt) f(x ); ξt := b f(xt) f(xt). We let Ft := σ (ξ1, . . . , ξt 1) denote the natural filtration. Note that xt is Ft-measurable. The following lemma serves as the fundamental step of our analysis, the proof of which can be found in the appendix. Lemma 4.2. For t 1, we have f(xt) 2 + t+1 t Lη2 t ηt f(xt), ξt + Lη2 t 2 ξt 2 . (7) Now we can follow the similar concentration argument from the convex setting. The difference now is the error term in the RHS of (7) can be transferred into the gradient term f(xt) 2 instead of a function value gap term. This actually makes things easier since this term can be readily absorbed by the gradient term in Ct, and we do not have to carefully impose an additional condition on wt to make a telescoping sum. For wt 0, we define Zt = wt Ct vt f(xt) 2 , 1 t T, where vt = 3σ2w2 t η2 t (ηt L 1)2, i=t Zi, 1 t T + 1. Using the same technique as in the previous section, we can prove the following key inequality. Theorem 4.3. Suppose that ηt and wt satisfy 0 wtη2 t L 1 2σ2 for all 1 t T. Then E [exp (St) | Ft] exp Markov s inequality gives us the following guarantee. Corollary 4.4. For all 1 t T, if ηt L 1 and 0 wtη2 t L 1 2σ2 , then f(xt) 2 + w T T +1 t=2 (wt wt 1) t + 3σ2 T X wtη2 t L 2 + log 1 Equipped with Lemmas 4.2 and 4.3, we are ready to prove Theorem 4.1 by specifying the choice of wt that satisfy the condition of Lemma 4.3. In the first case, we choose ηt = η, wt = w = 1 6σ2η where η = min{ 1 1 σ2LT }. In the second case, we set ηt = η t and wt = w = 1 6σ2η, where η = 1 L. We show the full proof in the Appendix. 4.2. Analysis of Ada Grad-Norm In this section, we show that Ada Grad-Norm (Algorithm 4) converges with high probability under minimal assumptions. Our main result is presented in Theorem 4.5. Algorithm 4 Ada Grad-Norm Parameters: x1, η, b0 for t = 1 to T do b2 0 + Pt i=1 b f(xi) 2 xt+1 = xt η High Probability Convergence of Stochastic Gradient Methods Theorem 4.5. Assume f is L-smooth and satisfies Assumptions (1), (2), (3). With probability at least 1 3δ, the iterate sequence (xt)t 1 output by Algorithm 4 satisfies t=1 f(xt) 2 4σ2T + r(δ) ηL log 4σ2T + r(δ) b2 0 + g(δ) 32ηL log 64ηL b0 + 16g(δ) 2 , where r(δ) = 2b2 0 + 4σ2 log 1 δ = O(1 + σ2 log 1 and g(δ) = 1 η + f(x1) 2 4b2 0 + 8σ2 b2 0 1 + log T δ = O(1 + σ2 log T Comparison with prior work: (Ward et al., 2019; Faw et al., 2022) show the convergence of this algorithm with polynomial dependency on 1 δ where 1 δ is the success probability. The latter relaxes several assumptions made in the former, including the boundedness of the gradients and noise variance. When assuming a sub-Gaussian noise, (Kavis et al., 2021) show a convergence in high probability, but still assume that the gradients are bounded. We remove this assumption and establish the convergence of Algorithm 4 in the following theorem. We highlight that the bound in 4.5 is adaptive to noise. When σ = 0, we obtain the O( 1 T ) convergence of the deterministic Ada Grad-Norm. We next give an overview of the technique. We will start from Lemma 4.6 (proof in the Appendix) and proceed to bound each term in the RHS of (10). In contrast to the techniques used in (Kavis et al., 2021), in which they multiply both sides of (23) by bt to separate bt from the term f(xt), ξt , we rely on the insight from (Liu et al., 2022) and multiply by bt 2bt b0 . This factor is but a small deviation from a constant, which helps us obtain a coefficient for f(xt), ξt that depends on bt 1. This makes the term f(xt),ξt 2bt 1 b0 a sub-Gaussian random variable. To bound PT t=1 f(xt),ξt 2bt 1 b0 , we follow an argument similar to the proof of Lemma 4.3. Finally, by bounding PT t=1 ξt 2 via a simple concentration argument, we can obtain a relationship between b T and PT t=1 f(xt) 2. Combining this with Lemma 4.6, we arrive at Theorem 4.5 via a self-bounding argument as used in (Li & Orabona, 2019). Lemma 4.6. For t 1, let ξt = b f(xt) f(xt), and Mt = maxi t ξi 2 then we have 2bt b0 ηLMT η b T T +1 η (2b T b0) 2 log b2 T b2 0 2bt 1 b0 . (10) Now, notice that f(xt),ξt 2bt 1 b0 follows a sub-Gaussian distribution with mean 0, we can obtain a bound for PT t=1 f(xt),ξt 2bt 1 b0 in the next lemma. The choice of the coefficient w is crucial but will be specified later. Lemma 4.7. For any w > 0, with probability at least 1 δ t=1 f(xt), ξt b2 0 log b2 T b2 0 + 4wσ2 f(xt 1) 2 (2bt 1 b0)2 + 2wσ2 f(x1) 2 Returning to Lemma 4.6, by choosing an appropriate coefficient w in Lemma 4.7, we can use a fraction of the LHS of (10) to cancel out the term PT t=2 4wσ2 f(xt 1) 2 (2bt 1 b0)2 in (11). It is also known that with probability at least 1 δ, MT σ2 1 + log T δ (Li & Orabona, 2020; Liu et al., 2022). Further, we have a relationship between PT t=1 f(xt) 2 and b T : v u u tb2 0 + 2 t=1 2 f(xt) 2. (12) The term PT t=1 ξt 2 can be bounded by σ2T + σ2 log 1 δ with high probability as in Lemma C.1. Finally for the LHS of 4.6, we have PT t=1 f(xt) 2 1 b T PT t=1 f(xt) 2. Now we can solve a system combin- ing the two relationships between PT t=1 f(xt) 2 and b T to obtain the desired bound. 5. Conclusion In this work, we present a generic approach to prove high probability convergence of stochastic gradient methods under sub-Gaussian noise. In the convex case, we show high probability bounds for (accelerated) SMD that depend on the distance from the initial solution to the optimal solution and do not require the bounded domain or bounded Bregman divergence assumption. In the non-convex case, we apply the same approach and obtain a high probability bound for SGD that improves over existing works. We also show that the boundedness of the gradients can be removed when showing high probability convergence of Ada Grad-Norm. Acknowledgement TDN and AE were supported in part by NSF CAREER grant CCF-1750333, NSF grant III-1908510, and an Alfred P. Sloan Research Fellowship. THN and HN were supported by NSF CAREER grant CCF-1750716. High Probability Convergence of Stochastic Gradient Methods Chung, F. and Lu, L. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1): 79 127, 2006. Cutkosky, A. and Mehta, H. High-probability bounds for non-convex stochastic optimization with heavy tails. Advances in Neural Information Processing Systems, 34: 4883 4895, 2021. Davis, D., Drusvyatskiy, D., Xiao, L., and Zhang, J. From low probability to high confidence in stochastic convex optimization. The Journal of Machine Learning Research, 22(1):2237 2274, 2021. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. Dvurechensky, P. and Gasnikov, A. Stochastic intermediate gradient method for convex problems with stochastic inexact oracle. Journal of Optimization Theory and Applications, 171(1):121 145, 2016. Faw, M., Tziotis, I., Caramanis, C., Mokhtari, A., Shakkottai, S., and Ward, R. The power of adaptivity in sgd: Selftuning step sizes with unbounded gradients and affine variance. ar Xiv preprint ar Xiv:2202.05791, 2022. Gorbunov, E., Danilova, M., and Gasnikov, A. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Advances in Neural Information Processing Systems, 33:15042 15053, 2020. Gorbunov, E., Danilova, M., Shibaev, I., Dvurechensky, P., and Gasnikov, A. Near-optimal high probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise. ar Xiv preprint ar Xiv:2106.05958, 2021. Harvey, N. J., Liaw, C., Plan, Y., and Randhawa, S. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pp. 1579 1613. PMLR, 2019. Hazan, E. and Kale, S. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489 2512, 2014. Juditsky, A., Nemirovski, A., and Tauvel, C. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17 58, 2011. Kakade, S. M. and Tewari, A. On the generalization ability of online strongly convex programming algorithms. Advances in Neural Information Processing Systems, 21, 2008. Kavis, A., Levy, K. Y., and Cevher, V. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. In International Conference on Learning Representations, 2021. Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365 397, 2012. Lan, G. First-order and stochastic optimization methods for machine learning. Springer, 2020. Li, S. and Liu, Y. High probability guarantees for nonconvex stochastic gradient descent with heavy tails. In International Conference on Machine Learning, pp. 12931 12963. PMLR, 2022. Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 983 992. PMLR, 2019. Li, X. and Orabona, F. A high probability analysis of adaptive sgd with momentum. ar Xiv preprint ar Xiv:2007.14294, 2020. Liu, Z., Nguyen, T. D., Ene, A., and Nguyen, H. L. On the convergence of adagrad on Rd: Beyond convexity, non-asymptotic rate and acceleration. ar Xiv preprint ar Xiv:2209.14827, 2022. Madden, L., Dall Anese, E., and Becker, S. High probability convergence and uniform stability bounds for nonconvex stochastic gradient descent. ar Xiv preprint ar Xiv:2006.05610, 2020. Nazin, A. V., Nemirovsky, A. S., Tsybakov, A. B., and Juditsky, A. B. Algorithms of robust stochastic optimization based on mirror descent method. Automation and Remote Control, 80(9):1607 1627, 2019. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009. Parletta, D. A., Paudice, A., Pontil, M., and Salzo, S. High probability bounds for stochastic subgradient schemes with heavy tailed noise. ar Xiv preprint ar Xiv:2208.08567, 2022. Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. ar Xiv preprint ar Xiv:1109.5647, 2011. Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. High Probability Convergence of Stochastic Gradient Methods Ward, R., Wu, X., and Bottou, L. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, pp. 6677 6686. PMLR, 2019. Zhang, J. and Cutkosky, A. Parameter-free regret in high probability with heavy tails. ar Xiv preprint ar Xiv:2210.14355, 2022. Zhang, J., Karimireddy, S. P., Veit, A., Kim, S., Reddi, S., Kumar, S., and Sra, S. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383 15393, 2020. High Probability Convergence of Stochastic Gradient Methods A. Proof of Lemma 2.2 Proof. Consider two cases either a 1/(2σ) or a 1/(2σ). First suppose a 1/(2σ). We use the inequality uv u2 here to first obtain a X + b2X2 i a X + b2X2 i a |X| + b2X2 i 1 4σ2 X2 + a2σ2 + b2X2 i . Thus, we have 1 i! a X + b2X2 i # 4σ2 X2 + a2σ2 + b2X2 i# = E b2X2 + exp 1 4σ2 + b2 X2 + a2σ2 1 4σ2 + b2 X2 a2σ2 4σ2 + b2 X2 + a2σ2 1 4σ2 X2 a2σ2 4σ2 + b2 σ2 + a2σ2 exp 2a2σ2 + b2σ2 exp 3 a2 + b2 σ2 . Next, let c = max(a, b) 1/(2σ). We have 1 i! a X + b2X2 i # = E exp a X + b2X2 a X E a X + exp a2X2 exp b2X2 a X = E exp a2 + b2 X2 + a X exp b2X2 1 E exp a2 + b2 X2 + c |X| exp c2X2 1 E exp a2 + b2 X2 + exp 2c2X2 1 E exp a2 + b2 + 2c2 X2 exp a2 + b2 + 2c2 σ2 exp 3 a2 + b2 σ2 . In the first inequality, we use the inequality ex x ex2 x. In the third inequality, we use x ex2 1 e2x2 1 x. This inequality can be proved with the Taylor expansion. 1 i! x2i + x2i+2 The case when b = 0 simply follows from the above proof. High Probability Convergence of Stochastic Gradient Methods B. Missing Proofs from Section 3 B.1. Stochastic Mirror Descent Proof of Lemma 3.2. By the optimality condition, we have D ηt b f(xt) + x Dψ (xt+1, xt) , x xt+1 E 0 and thus D ηt b f(xt), xt+1 x E x Dψ (xt+1, xt) , x xt+1 . x Dψ (xt+1, xt) , x xt+1 = ψ (xt+1) ψ (xt) , x xt+1 = Dψ (x , xt) Dψ (xt+1, xt) Dψ (x , xt+1) ηt D b f(xt), xt+1 x E Dψ (x , xt) Dψ (x , xt+1) Dψ (xt+1, xt) Dψ (x , xt) Dψ (x , xt+1) 1 2 xt+1 xt 2 , where we have used that Dψ (xt+1, xt) 1 2 xt+1 xt 2 by the strong convexity of ψ. By convexity, f (xt) f (x ) f (xt) , xt x = ξt, x xt + D b f (xt) , xt x E . Combining the two inequalities, we obtain ηt (f (xt) f (x )) + Dψ (x , xt+1) Dψ (x , xt) ηt ξt, x xt + ηt D b f(xt), xt xt+1 E 1 2 xt+1 xt 2 ηt ξt, x xt + η2 t 2 Using the triangle inequality and the bounded gradient assumption f(x) G, we obtain b f(xt) 2 = ξt + f(xt) 2 2 ξt 2 + 2 f(xt) 2 2 ξt 2 + G2 . Thus ηt (f (xt) f (x )) + Dψ (x , xt+1) Dψ (x , xt) ηt ξt, x xt + η2 t ξt 2 + G2 Proof of Corollary 3.4. Let K = 3σ2 T X t=1 wtη2 t + log 1 By Theorem 3.3 and Markov s inequality, we have Pr [S1 K] Pr [exp (S1) exp (K)] exp ( K) E [exp (S1)] exp ( K) exp High Probability Convergence of Stochastic Gradient Methods Note that since vt + wt wt 1 t=1 wtηt (f (xt) f (x )) G2 T X t=1 wtη2 t + t=1 (wt Dψ (x , xt+1) (vt + wt)Dψ (x , xt)) t=1 wtηt (f (xt) f (x )) G2 T X t=1 wtη2 t + t=1 (wt Dψ (x , xt+1) wt 1Dψ (x , xt)) t=1 wtηt (f (xt) f (x )) G2 T X t=1 wtη2 t + w T Dψ (x , x T +1) w0Dψ (x , x1) . Therefore, with probability at least 1 δ, we have t=1 wtηt (f (xt) f (x )) + w T Dψ (x , x T +1) w0Dψ (x , x1) + G2 + 3σ2 T X t=1 wt+1η2 t + log 1 Next we extend the analysis to the setting where the T is not known and we use the step sizes ηt = η t to complete the proof of Theorem 3.1. Corollary B.1. Suppose we run the Stochastic Mirror Descent algorithm with time-varying step sizes ηt = η t. Let w T = 1 12σ2η2( PT t=1 1 t) and wt 1 = wt + 6σ2η2 t w2 t for all 1 t T. The sequence {wt} satisfies the conditions required by Corollary 3.4. By Corollary 3.4, for any δ > 0, the following events hold with probability at least 1 δ: Dψ (x , x T +1) 2Dψ (x , x1) + 12 G2 + σ2 1 + log 1 δ η2(1 + log T), and t=1 (f (xt) f (x )) 1 T 2Dψ (x , x1) G2 + σ2 1 + log 1 η(1 + log T). In particular, setting ηt = r Dψ(x ,x1) 6(G2+σ2(1+ln( 1 δ)))t we obtain the second case of Theorem 3.1. Proof of Corollary B.1. Recall from Corollary 3.4 that the sequence {wt} needs to satisfy the following conditions for all 1 t T: wt + 6σ2η2 t w2 t wt 1 and wtη2 t 1 4σ2 . Let Mt = 6σ2 Pt i=1 η2 i and C = MT = 6σ2η2 PT t=1 1 t . We set w T = 1 C+MT . For 1 t T, we set wt so that the first condition holds with equality wt 1 = wt + 6σ2η2 t w2 t . We can show by induction that, for every 1 t T, we have wt 1 C + Mt . The base case t = T follows from the definition of w T . Consider 1 t T. Using the definition of wt and the inductive High Probability Convergence of Stochastic Gradient Methods hypothesis, we obtain wt 1 = wt + 6σ2η2 t w2 t 1 C + Mt + 6σ2η2 t (C + Mt)2 1 C + Mt + (C + Mt) (C + Mt 1) (C + Mt) (C + Mt 1) = 1 C + Mt 1 Using this fact, we now show that {wt} satisfies the second condition. For every 1 t T, we have wtη2 t η2 t C η2 t 6σ2η2 t = 1 6σ2 Thus, by Corollary 3.4, with probability 1 δ, we have t=1 wtηt (f (xt) f (x )) + w T Dψ (x , x T +1) w0Dψ (x , x1) + G2 + 3σ2 T X t=1 wtη2 t + log 1 Note that w T = 1 2C and 1 2C wt 1 C for all 1 t T. Thus, we obtain t=1 (f (xt) f (x )) + 1 2C Dψ (x , x T +1) 1 C Dψ (x , x1) + G2 + 3σ2 1 t=1 η2 t + log 1 Plugging in ηt = η t and simplifying, we obtain t=1 (f (xt) f (x )) + Dψ (x , x T +1) 2Dψ (x , x1) + 2G2 + 6σ2 η2 T X = 2Dψ (x , x1) + 2G2 + 6σ2 1 + 2 log 1 Thus, we have t=1 (f (xt) f (x )) 1 2Dψ (x , x1) η + 2G2 + 6σ2 1 + 2 log 1 Dψ (x , x T +1) 2Dψ (x , x1) + 2G2 + 6σ2 1 + 2 log 1 High Probability Convergence of Stochastic Gradient Methods B.2. Accelerated Stochastic Mirror Descent Proof of Lemma 3.7. Starting with smoothness, we obtain f (yt) f (xt) + f (xt) , yt xt + G yt xt + L 2 yt xt 2 x X = f (xt) + f (xt) , yt 1 xt + f (xt) , yt yt 1 + G yt xt + L = (1 αt) (f (xt) + f (xt) , yt 1 xt ) | {z } convexity +αt (f (xt) + f (xt) , yt 1 xt ) | {z } convexity + αt f (xt) , zt yt 1 + G yt xt + L (1 αt) f (yt 1) + αtf (xt) + αt f (xt) , zt xt + G yt xt | {z } =αt zt zt 1 2 yt xt 2 | {z } =α2 t zt zt 1 2 = (1 αt) f (yt 1) + αtf (xt) + αt f (xt) , zt xt + Gαt zt zt 1 + L 2 α2 t zt zt 1 2 . By the optimality condition for zt, ηt D b f(xt), zt x E x Dψ (zt, zt 1) , x zt = Dψ (x , zt 1) Dψ (zt, zt 1) Dψ (x , zt) . Rearranging, we obtain Dψ (x , zt) Dψ (x , zt 1) + Dψ (zt, zt 1) ηt D b f (xt) , x zt E = ηt f (xt) + ξt, x zt . By combining the two inequalities, we obtain f (yt) + αt ηt (Dψ (x , zt) Dψ (x , zt 1) + Dψ (zt, zt 1)) (1 αt) f (yt 1) + αt (f (xt) + f (xt) , x xt ) | {z } convexity + Gαt zt zt 1 + L 2 α2 t zt zt 1 2 + αt ξt, x zt (1 αt) f (yt 1) + αtf (x ) + Gαt zt zt 1 + L 2 α2 t zt zt 1 2 + αt ξt, x zt . Subtracting f (x ) from both sides, rearranging, and using that Dψ (zt, zt 1) 1 2 zt zt 1 2, we obtain f (yt) f (x ) + αt ηt (Dψ (x , zt) Dψ (x , zt 1)) (1 αt) (f (yt 1) f (x )) + αt ξt, x zt + Gαt zt zt 1 αt 1 Lαtηt 2ηt zt zt 1 2 = (1 αt) (f (yt 1) f (x )) + αt ξt, x zt 1 + αt ξt, zt zt 1 + Gαt zt zt 1 αt 1 Lαtηt 2ηt zt zt 1 2 (1 αt) (f (yt 1) f (x )) + αt ξt, x zt 1 + αt zt zt 1 ( ξt + G) αt 1 Lαtηt 2ηt zt zt 1 2 (1 αt) (f (yt 1) f (x )) + αt ξt, x zt 1 + αtηt 2 (1 Lαtηt) ( ξt + G)2 . High Probability Convergence of Stochastic Gradient Methods Finally, we divide by αt ηt , and obtain ηt αt (f (yt) f (x )) + Dψ (x , zt) Dψ (x , zt 1) αt (1 αt) (f (yt 1) f (x )) + ηt ξt, x zt 1 + η2 t 2 (1 Lαtηt) ( ξt + G)2 αt (1 αt) (f (yt 1) f (x )) + ηt ξt, x zt 1 + η2 t 1 Lαtηt ξt 2 + G2 . Proof of Theorem 3.8. We proceed by induction on t. Consider the base case t = T + 1, the inequality trivially holds. Next, we consider t T. We have E [exp (St) | Ft] = E [exp (Zt + St+1) | Ft] = E [E [exp (Zt + St+1) | Ft+1] | Ft] . (13) We now analyze the inner expectation. Conditioned on Ft+1, Zt is fixed. Using the inductive hypothesis, we obtain E [exp (Zt + St+1) | Ft+1] exp (Zt) exp i=t+1 wi η2 i 1 Lαiηi Let Xt = ηt ξt, x zt 1 . By Lemma 3.7, we have ηt αt (f (yt) f (x )) ηt αt (1 αt) (f (yt 1) f (x )) η2 t 1 Lαtηt G2 + Dψ (x , zt) Dψ (x , zt 1) Xt + η2 t (1 Lαtηt) ξt 2 , Zt wt Xt + wt η2 t 1 Lαtηt ξt 2 vt Dψ (x , zt 1) . Plugging into (14), we obtain E [exp (Zt + St+1) | Ft+1] wt Xt vt Dψ (x , zt 1) + wt η2 t 1 Lαtηt ξt 2 + 3σ2 T X i=t+1 wi η2 i 1 Lαiηi Plugging into (13), we obtain E [exp (St) | Ft] vt Dψ (x , zt 1) + 3σ2 T X i=t+1 wi η2 i 1 Lαiηi E exp wt Xt + wt η2 t 1 Lαtηt ξt 2 High Probability Convergence of Stochastic Gradient Methods Next, we analyze the expectation on the RHS of the above inequality. We have E exp wt Xt + wt η2 t 1 Lαtηt ξt 2 wt Xt + wt η2 t 1 Lαtηt ξt 2 1 + wt η2 t 1 Lαtηt ξt 2 + wt Xt + wt η2 t 1 Lαtηt ξt 2 1 + wt η2 t 1 Lαtηt ξt 2 + wtηt x zt 1 ξt + wt η2 t 1 Lαtηt ξt 2 exp 3 w2 t η2 t x zt 1 2 + wt η2 t 1 Lαtηt exp 3 2w2 t η2 t Dψ (x , zt 1) + wt η2 t 1 Lαtηt On the first line we used the Taylor expansion of ex, and on the second line we used that E [Xt | Ft] = 0. On the third line, we used Cauchy-Schwartz and obtained Xt = ηt ξt, x zt 1 ηt ξt x zt 1 . On the fourth line, we applied Lemma 2.2 with X = ξt , a = wtηt x zt 1 , and b2 = wt η2 t 1 Lαtηt 1 4σ2 . On the fifth line, we used that Dψ (x , zt 1) 1 2 x zt 1 2, which follows from the strong convexity of ψ. Plugging in (16) into (15) and using that vt = 6σ2w2 t η2 t , we obtain E [exp (St) | Ft] exp i=t wi η2 i 1 Lαiηi Proof of Corollary 3.9. Let K = 3σ2 T X t=1 wt η2 t 1 Lαtηt + log 1 By Theorem 3.8 and Markov s inequality, we have Pr [S1 K] Pr [exp (S1) exp (K)] exp ( K) E [exp (S1)] exp ( K) exp t=1 wt η2 t 1 Lαtηt High Probability Convergence of Stochastic Gradient Methods Note that since vt + wt wt 1 αt (f (yt) f (x )) ηt (1 αt) αt (f (yt 1) f (x )) t=1 wt Dψ (x , zt) (vt + wt)Dψ (x , zt 1) G2 T X t=1 wt η2 t 1 Lαtηt αt (f (yt) f (x )) ηt (1 αt) αt (f (yt 1) f (x )) t=1 wt Dψ (x , zt) wt 1Dψ (x , zt 1) G2 T X t=1 wt η2 t 1 Lαtηt αt (f (yt) f (x )) ηt (1 αt) αt (f (yt 1) f (x )) + w T Dψ (x , z T ) w0Dψ (x , z0) G2 T X t=1 wt η2 t 1 Lαtηt . Therefore, with probability at least 1 δ, we have αt (f (yt) f (x )) ηt (1 αt) αt (f (yt 1) f (x )) + w T Dψ (x , z T ) w0Dψ (x , z0) + G2 + 3σ2 T X t=1 wt η2 t 1 Lαtηt + log 1 Corollary B.2. Suppose we run the Accelerated Stochastic Mirror Descent algorithm with the standard choices αt = 2 t+1 and ηt = ηt with η 1 4L. Let w T = 1 3σ2η2T (T +1)(2T +1) and wt 1 = wt + 6σ2η2 t w2 t for all 1 t T. The sequence {wt}0 t T satisfies the conditions required by Corollary 3.9. By Corollary 3.9, with probability at least 1 δ, Dψ (x , z T ) 2Dψ (x , z0) + 12 G2 + 1 + log 1 δ σ2 η2T 3 and f (y T ) f (x ) 4Dψ (x , z0) ηT 2 + 24 G2 + 1 + log 1 In particular, setting η = min G2+σ2(1+log( 1 , we obtain the first case of Theorem 3.6. Proof of Corollary B.2. Recall from Corollary 3.9 that the sequence {wt} needs to satisfy the following conditions: wt + 6σ2η2 t w2 t wt 1, 1 t T, (17) wtη2 t 1 Lαtηt 1 4σ2 , 0 t T. (18) We will set {wt} so that it satisfies the following additional condition, which will allow us to telescope the sum on the RHS of Corollary 3.9: wt 1 ηt 1 αt 1 wt ηt (1 αt) αt , 1 t T. (19) High Probability Convergence of Stochastic Gradient Methods Given w T , we set wt 1 for every 1 t T so that the first condition (17) holds with equality: wt 1 = wt + 6σ2η2 t w2 t = wt + 6σ2η2t2w2 t . Let C = σ2η2T (T + 1) (2T + 1). We set C + 6σ2η2 PT i=1 i2 = 1 C + σ2η2T (T + 1) (2T + 1) = 1 2σ2η2T (T + 1) (2T + 1). Given this choice for w T , we now verify that, for all 0 t T, we have wt 1 C + 6σ2η2 Pt i=1 i2 = 1 C + σ2η2t (t + 1) (2t + 1). We proceed by induction on t. The base case t = T follows from the definition of w T . Consider t T. Using the definition of wt 1 and the inductive hypothesis, we obtain wt 1 = wt + 6σ2η2t2w2 t 1 C + 6σ2η2 Pt i=1 i2 + 6σ2η2t2 C + 6σ2η2 Pt i=1 i2 2 1 C + 6σ2η2 Pt i=1 i2 + C + 6σ2η2 Pt i=1 i2 C + 6σ2η2 Pt 1 i=1 i2 C + 6σ2η2 Pt i=1 i2 C + 6σ2η2 Pt 1 i=1 i2 C + 6σ2η2 Pt 1 i=1 i2 as needed. Let us now verify that the second condition (18) also holds. Using that 2t t+1 2, Lη 1 4, and T 2, we obtain wtη2 t 1 Lαtηt = wtη2t2 t+1 2wtη2t2 2η2t2 C + 6σ2η2t2 σ2T (T + 1) (2T + 1) + 3σ2t2 1 σ2 (2T + 1) + 3σ2 1 4σ2 Let us now verify that the third condition (19) also holds. Since ηt = ηt and αt = 2 t+1, we have ηt 1 αt 1 = ηt(1 αt) αt = ηt(t 1) 2 . Since wt wt 1, it follows that condition (19) holds. We now turn our attention to the convergence. By Corollary 3.9, with probability 1 δ, we have αt (f (yt) f (x )) ηt (1 αt) αt (f (yt 1) f (x )) + w T Dψ (x , z T ) w0Dψ (x , z0) + G2 + 3σ2 T X t=1 wt η2 t 1 Lαtηt + log 1 Grouping terms on the LHS and using that α1 = 1, we obtain wt ηt αt wt+1 ηt+1 (1 αt+1) (f (yt) f (x )) + w T ηT αT (f (y T ) f (x )) + w T Dψ (x , z T ) w0Dψ (x , z0) + G2 + 3σ2 T X t=1 wt η2 t 1 Lαtηt + log 1 High Probability Convergence of Stochastic Gradient Methods Since {wt} satisfies condition (19), the coefficient of f (yt) f (x ) is non-negative and thus we can drop the above sum. We obtain w T ηT αT (f (y T ) f (x )) + w T Dψ (x , z T ) w0Dψ (x , z0) + G2 + 3σ2 T X t=1 wt η2 t 1 Lαtηt + log 1 Using that w T = 1 2C and wt 1 C for all 0 t T 1, we obtain 1 2C ηT αT (f (y T ) f (x )) + 1 2C Dψ (x , z T ) C Dψ (x , z0) + 1 C G2 + 3σ2 T X η2 t 1 Lαtηt + log 1 ηT αT (f (y T ) f (x )) + Dψ (x , z T ) 2Dψ (x , z0) + 2 G2 + 3σ2 T X η2 t 1 Lαtηt + 2C log 1 = 2Dψ (x , z0) + 2 G2 + 3σ2 T X η2 t 1 Lαtηt + 2σ2 log 1 η2T (T + 1) (2T + 1) . Using that Lη 1 4 and 2t t+1 2, we obtain η2 t 1 Lαtηt = t=1 2η2t2 = 1 3η2T (T + 1) (2T + 1) . Plugging in and using that ηT = ηT and αT = 2 T +1, we obtain η T (T + 1) 2 (f (y T ) f (x )) + Dψ (x , z T ) 2Dψ (x , z0) + 2 3G2 + 2 1 + log 1 σ2 η2T (T + 1) (2T + 1) 2Dψ (x , z0) + 2 G2 + 1 + log 1 σ2 η2T (T + 1) (2T + 1) . We can further simplify the bound by lower bounding T (T + 1) T 2 and upper bounding T (T + 1) (2T + 1) 6T 3. We obtain ηT 2 (f (y T ) f (x )) + 2Dψ (x , z T ) 4Dψ (x , z0) + 24 G2 + 1 + log 1 Thus we obtain f (y T ) f (x ) 4Dψ (x , z0) ηT 2 + 24 G2 + 1 + log 1 Dψ (x , z T ) 2Dψ (x , z0) + 12 G2 + 1 + log 1 High Probability Convergence of Stochastic Gradient Methods Corollary B.3. Suppose we run the Accelerated Stochastic Mirror Descent algorithm with the standard choices αt = 2 t+1 and ηt = min n t 4L, η o . Let w T = 1 12σ2 PT i=1 η2 t and wt 1 = wt + 6σ2η2 t w2 t for all 1 t T. The sequence {wt}0 t T satisfies the conditions required by Corollary 3.9. By Corollary 3.9, with probability at least 1 δ, Dψ (x , z T ) 2Dψ (x , z0) + 12 G2 + 1 + log 1 δ σ2 η2(1 + log T) and f (y T ) f (x ) 16L T 2 Dψ (x , z0) + 2 T 1/2η 2Dψ (x , z0) + 12 G2 + 1 + log 1 σ2 η2(1 + log T) . In particular, setting ηt = min G2+σ2(1+log( 1 , we obtain the second case of Theorem 3.6. Proof of Corollary B.3. Recall from Corollary 3.9 that the sequence {wt} needs to satisfy the following conditions: wt + 6σ2η2 t w2 t wt 1, 1 t T, (20) wtη2 t 1 Lαtηt 1 4σ2 , 0 t T. (21) We will set {wt} so that it satisfies the following additional condition, which will allow us to telescope the sum on the RHS of Corollary 3.9: wt 1 ηt 1 αt 1 wt ηt (1 αt) αt , 1 t T 1. (22) Given w T , we set wt 1 for every 1 t T so that the first condition (20) holds with equality: wt 1 = wt + 6σ2η2 t w2 t = wt + 6σ2η2t2w2 t . Let C = 6σ2 PT i=1 η2 t . We set 12σ2 PT i=1 η2 t = 1 Given this choice for w T , we now verify that, for all 0 t T, we have wt 1 C + 6σ2 Pt i=1 η2 i . We proceed by induction on t. The base case t = T follows from the definition of w T . Consider t T. Using the definition of wt 1 and the inductive hypothesis, we obtain wt 1 = wt + 6σ2η2 t w2 t 1 C + 6σ2 Pt i=1 η2 i + 6σ2η2 t C + 6σ2 Pt i=1 η2 i 2 1 C + 6σ2 Pt i=1 η2 i + C + 6σ2 Pt i=1 η2 i C + 6σ2 Pt 1 i=1 η2 i C + 6σ2 Pt i=1 η2 i C + 6σ2 Pt 1 i=1 η2 i C + 6σ2 Pt 1 i=1 η2 i Let us now verify that the second condition (21) also holds. Using that Lηt t 4, and T 2, we obtain wtη2 t 1 Lαtηt wtη2 t 1 t 4 2 t+1 2wtη2 t 2η2 t 6σ2 PT i=1 η2 t + 6σ2 Pt i=1 η2 i 2η2 t 12σ2η2 t 1 4σ2 High Probability Convergence of Stochastic Gradient Methods Let us now verify that the third condition (22) also holds. Since αt = 2 t+1, we have ηt 1 αt 1 = ηt 1t αt = ηt (t 1) If ηt 1 = t 1 4L then we have ηt t 4L and ηt(1 αt) αt ηt 1 αt 1 = t(t 1) 8L . If ηt 1 = η t 1 then ηt = η t,we also have αt 1 . Since wt wt 1, it follows that condition (22) holds. We now turn our attention to the convergence. By Corollary 3.9, with probability 1 δ, we have αt (f (yt) f (x )) ηt (1 αt) αt (f (yt 1) f (x )) + w T Dψ (x , z T ) w0Dψ (x , z0) + G2 + 3σ2 T X t=1 wt η2 t 1 Lαtηt + log 1 Grouping terms on the LHS and using that α1 = 1, we obtain wt ηt αt wt+1 ηt+1 (1 αt+1) (f (yt) f (x )) + w T ηT αT (f (y T ) f (x )) + w T Dψ (x , z T ) w0Dψ (x , z0) + G2 + 3σ2 T X t=1 wt η2 t 1 Lαtηt + log 1 Since {wt} satisfies condition (22), the coefficient of f (yt) f (x ) is non-negative and thus we can drop the above sum. We obtain w T ηT αT (f (y T ) f (x )) + w T Dψ (x , z T ) w0Dψ (x , z0) + G2 + 3σ2 T X t=1 wt η2 t 1 Lαtηt + log 1 Using that w T = 1 2C and wt 1 C for all 0 t T 1, we obtain 1 2C ηT αT (f (y T ) f (x )) + 1 2C Dψ (x , z T ) C Dψ (x , z0) + 1 C G2 + 3σ2 T X η2 t 1 Lαtηt + log 1 Thus ηT αT (f (y T ) f (x )) + Dψ (x , z T ) 2Dψ (x , z0) + 2 G2 + 3σ2 T X η2 t 1 Lαtηt + 2C log 1 Using that Lηt t 4, we obtain η2 t 1 Lαtηt = t=1 2η2 t = C Plugging in and using that ηT = ηT and αT = 2 T +1, we obtain 2 (f (y T ) f (x )) + Dψ (x , z T ) 2Dψ (x , z0) + 2G2 + 6 1 + log 1 High Probability Convergence of Stochastic Gradient Methods T which means T 3/2 4Lη then ηT = T 4L we have C = 6σ2 T X i=1 η2 t = 6σ2 i=1 t2 3σ2T 3 2 (f (y T ) f (x )) + Dψ (x , z T ) 2Dψ (x , z0) + G2 + 1 + log 1 which entails f (y T ) f (x ) 16L T 2 Dψ (x , z0) + G2 + 1 + log 1 T 2 Dψ (x , z0) + 6 G2 + 1 + log 1 T 2 Dψ (x , z0) + 24 G2 + 1 + log 1 Dψ (x , z T ) 2Dψ (x , z0) + 12 G2 + 1 + log 1 T T 4L then ηT = η T . Let T0 be the largest t such that η t t 4L, we have T 3 0 16L2η2 C = 6σ2 T X i=1 η2 t + 6σ2 T X i=T0+1 η2 t i=1 t2 + 6σ2η2 T X 16L2 T 3 0 + 6σ2η2 T X t 6σ2η2(1 + log T). f (y T ) f (x ) 2 T 1/2η 2Dψ (x , z0) + 12 G2 + 1 + log 1 σ2 η2(1 + log T) , Dψ (x , z T ) 2Dψ (x , z0) + 12 G2 + 1 + log 1 σ2 η2(1 + log T). Overall we have f (y T ) f (x ) 16L T 2 Dψ (x , z0) + 2 T 1/2η 2Dψ (x , z0) + 12 G2 + 1 + log 1 σ2 η2(1 + log T) . High Probability Convergence of Stochastic Gradient Methods C. Missing Proofs from Section 4 C.1. Stochastic Gradient Descent Proof of Lemma 4.2. We start from the smoothness of f f(xt+1) f(xt) f(xt), xt+1 xt + L 2 xt+1 xt 2 = ηt D f(xt), b f(xt) E + Lη2 t 2 b f(xt) 2 . By writing b f(xt) = ξt + f(xt) we have f(xt+1) f(xt) ηt f(xt), ξt + f(xt) + Lη2 t 2 ξt + f(xt) 2 = ηt f(xt) 2 ηt f(xt), ξt + Lη2 t 2 ξt 2 + Lη2 t 2 f(xt) 2 + Lη2 t f(xt), ξt . We obtain the inequality (7) by rearranging the terms. Proof of Theorem 4.3. We prove by induction. The base case t = T + 1 trivially holds. Consider 1 t T, we have E [exp (St) | Ft] = E [E [exp (Zt + St+1) | Ft+1] | Ft] = E [exp (Zt) E [exp (St+1) | Ft+1] | Fk] . From the induction hypothesis we have E [exp (St+1) | Ft+1] exp 3σ2 PT i=t+1 wiη2 i L 2 , hence E [exp (St) | Ft] exp E [exp (Zt) | Ft] . We have then E [exp (Zt) | Ft] = E exp wt f(xt) 2 + t+1 t vt f(x T ) 2 | Ft ηt(ηt L 1) f(xt), ξt + η2 t L 2 ξt 2 vt f(xt) 2 | Ft = exp vt f(xt) 2 E exp wt ηt(ηt L 1) f(xt), ξt + η2 t L 2 ξt 2 | Ft exp vt f(xt) 2 exp 3σ2 w2 t η2 t (ηt L 1)2 f(xt) 2 + wtη2 t L 2 = exp 3σ2 wtη2 t L 2 where the second line is due to (7) in Lemma 4.2 and the second to last line is due to Lemma 2.2.Therefore E [exp (St) | Ft] exp which we what we need to show. Proof of Corollary 4.4. In Lemma 4.3, Let t = 1 we obtain E [exp (S1)] exp High Probability Convergence of Stochastic Gradient Methods hence by Markov s inequality we have In other words, with probability 1 δ (once the condition in Lemma 4.3 is satisfied) f(xt) 2 + wt ( t+1 t) wtη2 t L 2 + log 1 f(xt) 2 + w T T +1 w1 1 + t=2 (wt wt 1) t + 3σ2 T X Proof of Theorem 4.1 . First case. Starting from this inequality, we will specify the choice of ηt and wt to obtain the bound. Consider ηt = η with ηL 1, wt = w = 1 6σ2η. Note that wtη2 t L = ηL 6σ2 1 2σ2 satisfies the condition of Lemma 4.3, we have LHS of (9) = w T +1 + 3σ2w2η2(ηL 1)2 f(xt) 2 = w T +1 + wη 2(ηL 1)2 f(xt) 2 w T +1 + wη t=1 f(xt) 2 where the last inequality is due to 1 ηL 2 when 0 ηL 1. Besides, RHS of (9) =w 1 + 3σ2 2 wη2LT + log 1 Hence with probability 1 δ t=1 f(xt) 2 + 2 T +1 η + 3σ2ηLT + 2 η + 3σ2ηLT + 12σ2 log 1 Finally by choosing η = min 1 L; q and noticing T +1 0, we obtain the desired inequality. Second case. High Probability Convergence of Stochastic Gradient Methods Consider ηt = η t with ηL 1, wt = w = 1 6σ2η . Again, we have wtη2 t L = ηL 6σ2t 1 2σ2 , then LHS of (9) = f(xt) 2 + w T +1 f(xt) 2 + w T +1 t 3σ2wη 1 ηL f(xt) 2 + w T +1 f(xt) 2 + w T +1 t f(xt) 2 + w T +1 wη t=1 f(xt) 2 + w T +1, where the second inequality is due to 1 ηL 2 when 0 ηL t 1. Besides, RHS of (9) =w 1 + 3σ2 2 wη2L(1 + log T) + log 1 Therefore, with probability 1 δ t=1 f(xt) 2 + 2 η + 3σ2ηL (1 + log T) + 2 η + 3σ2ηL (1 + log T) + 12σ2 log 1 Choose η = 1 L, and notice T +1 0, we obtain t=1 f(xt) 2 2 1L + 3σ2 (1 + log T) + 12σ2 log 1 C.2. Ada Grad-Norm Proof of Lemma 4.6. Starting from the smoothness of f f(xt+1) f(xt) f(xt), xt+1 xt + L 2 xt+1 xt 2 D f(xt), b f(xt) E + Lη2 bt f(xt) 2 η bt f(xt), ξt + Lη2 b f(xt) 2 . (23) High Probability Convergence of Stochastic Gradient Methods Multiplying both sides by bt η(2bt b0) and rearranging, we obtain 2bt b0 f(xt), ξt 2bt b0 + bt ( t t+1) η (2bt b0) + ηL 2bt (2bt b0) = 1 2bt 1 b0 1 2bt b0 f(xt), ξt f(xt), ξt + bt ( t t+1) η (2bt b0) + ηL 2bt (2bt b0) b f(xt) 2 . (24) Note that by the smoothness of f we also have f(xt) 2 2L t. Combining with Cauchy-Schwatz inequality we have 1 2bt 1 b0 1 2bt b0 bt 1 2bt 1 b0 bt 2bt b0 2ηL + 1 2bt 1 b0 1 2bt b0 bt 1 2bt 1 b0 bt 2bt b0 = bt 1 2bt 1 b0 bt 2bt b0 η + 1 2bt 1 b0 1 2bt b0 Plugging into (24) we obtain 2bt b0 1 2bt 1 b0 1 2bt b0 b0 ξt 2 f(xt), ξt + bt 1 t η (2bt 1 b0) bt t+1 η (2bt b0) + ηL 2bt (2bt b0) b f(xt) 2 . Sum up from 1 to T 1 2bt 1 b0 1 2bt b0 bt 1 t η (2bt 1 b0) bt t+1 η (2bt b0) ηL 2bt (2bt b0) η b T T +1 η (2b T b0) + ηL η b T T +1 η (2b T b0) + ηL 2 log b2 T b2 0 Proof of Lemma 4.7. For 1 t T, given |λ| 1 E h exp λ2 ξt 2 | Ft i exp λ2σ2 . Thus D f(xt) f(xt) , ξt E is a centered σ-sub-Gaussian RV given Ft, and we can apply Lemma 2.2 for a = w f(xt) 2bt 1 b0 and b = 0, for some constant w > 0 to get E exp w f(xt), ξt w f(xt) D f(xt) f(xt) , ξt E 2w2 f(xt) 2 σ2 (2bt 1 b0)2 High Probability Convergence of Stochastic Gradient Methods By a simple induction argument we obtain t=1 w f(xt), ξt 2bt 1 b0 2w2 f(xt) 2 σ2 (2bt 1 b0)2 Hence, by Markov s inequality t=1 w f(xt), ξt 2bt 1 b0 2w2 f(xt) 2 σ2 (2bt 1 b0)2 log 1 which implies with probability at least 1 δ, we have t=1 w f(xt), ξt 2w2σ2 f(xt) 2 (2bt 1 b0)2 + log 1 However, we now have a mismatch between the index of the numerator and denominator of the first term in the RHS. To resolve this, observing that f(xt) 2 2 f(xt) f(xt 1) 2 + 2 f(xt 1) 2 and using the smoothness of f for the first term, ie f(xt) f(xt 1) L xt xt 1 = Lη bt 1 b f(xt 1) , we have 2w2σ2 f(xt) 2 (2bt 1 b0)2 4w2η2L2σ2 b f(xt 1) 2 b2 t 1 (2bt 1 b0)2 + 4w2σ2 f(xt 1) 2 (2bt 1 b0)2 + 2w2σ2 f(x1) 2 Finally, since 2bt 1 b0 b0 and PT t=2 b f(xt 1) 2 b2 t 1 log b2 T b2 0 , the proof is completed. Lemma C.1. With probability at least 1 δ t=1 ξt 2 σ2T + σ2 log 1 Proof of Lemma C.1. It is not hard to verify that PT t=1 ξt 2 Thus by Markov s inequality t=1 ξt 2 σ2T + σ2 log 1 PT t=1 ξt 2 Therefore with probability at least 1 δ t=1 ξt 2 σ2T + σ2 log 1 Proof of Theorem 4.5. From Lemma 4.6 and 4.7, we have with probability at least 1 δ 2bt b0 ηLMT η + 2wσ2 f(x1) 2 2 + 4wη2L2σ2 log b2 T b2 0 4wσ2 f(xt 1) 2 (2bt 1 b0)2 + 1 High Probability Convergence of Stochastic Gradient Methods Here we choose w = b0 8σ2 min n 1; b0 ηL o and simplify the result to get 4wσ2 f(xt 1) 2 (2bt 1 b0)2 ηLMT η + f(x1) 2 4b2 0 + ηL log b2 T b2 0 + 8σ2 Note that by the choice of w, in the LHS of the above, (2bt 1 b0)2 b0 2 (2bt 1 b0)2 1 2 (2bt 1 b0). Hence, we have 4wσ2 f(xt 1) 2 (2bt 1 b0)2 . which implies η + f(x1) 2 4b2 0 + ηL log b2 T b2 0 + 8σ2 It is known that with probability at least 1 δ, MT σ2 1 + log T δ (Li & Orabona, 2020; Liu et al., 2022). By the union bound, we have with probability at least 1 2δ 4b T ηL log b2 T b2 0 + 1 η + f(x1) 2 4b2 0 + 8σ2 t=1 f(xt) 2 4b T ηL log b2 T b2 0 + 1 η + f(x1) 2 4b2 0 + 8σ2 | {z } g(δ)=O(1+σ2 log T v u u tb2 0 + v u u tb2 0 + 2 t=1 2 f(xt) 2. Now we consider the following two cases Case 1. PT t=1 2 f(xt) 2 b2 0 + 2 PT t=1 ξt 2, then we have t=1 f(xt) 2 4 v u u t2b2 0 + 4 ηL log 2b2 0 + 4 PT t=1 ξt 2 b2 0 + g(δ) High Probability Convergence of Stochastic Gradient Methods Case 2. PT t=1 2 f(xt) 2 > b2 0 + 2 PT t=1 ξt 2, then we have t=1 f(xt) 2 8 t=1 f(xt) 2 ηL log 4 PT t=1 f(xt) 2 b2 0 + g(δ) t=1 f(xt) 2 16ηL log 2 q PT t=1 f(xt) 2 q PT t=1 f(xt) 2 32ηL + 16ηL log 64ηL q PT t=1 f(xt) 2 2 + 16ηL log 64ηL t=1 f(xt) 2 32ηL log 64ηL b0 + 16g(δ). Combining the two cases, we have t=1 f(xt) 2 4 v u u t2b2 0 + 4 ηL log 2b2 0 + 4 PT t=1 ξt 2 b2 0 + g(δ) + 32ηL log 64ηL b0 + 16g(δ) 2 . The final step is to use Lemma C.1 and union bound to get, with probability at least 1 3δ t=1 f(xt) 2 4 2b2 0 + 4σ2T + 4σ2 log 1 ηL log 2b2 0 + 4σ2T + 4σ2 log 1 δ b2 0 + g(δ) 32ηL log 64ηL b0 + 16g(δ) 2 .