# anderson_acceleration_of_proximal_gradient_methods__1c0d5772.pdf Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization Vien V. Mai 1 Mikael Johansson 1 Stochastic gradient methods with momentum are widely used in applications and at the core of optimization subroutines in many popular machine learning libraries. However, their sample complexities have not been obtained for problems beyond those that are convex or smooth. This paper establishes the convergence rate of a stochastic subgradient method with a momentum term of Polyak type for a broad class of non-smooth, non-convex, and constrained optimization problems. Our key innovation is the construction of a special Lyapunov function for which the proven complexity can be achieved without any tuning of the momentum parameter. For smooth problems, we extend the known complexity bound to the constrained case and demonstrate how the unconstrained case can be analyzed under weaker assumptions than the state-of-the-art. Numerical results confirm our theoretical developments. 1. Introduction We study the stochastic optimization problem minimize x X f(x) := EP [f(x; S)] = Z S f(x; s)d P(s), (1) where S P is a random variable; f(x; s) is the instantaneous loss parameterized by x on a sample s S; and X Rn is a closed convex set. In this paper, we move beyond convex and/or smooth optimization and consider f that belongs to a broad class of non-smooth and non-convex functions called ρ-weakly convex, meaning that x 7 f(x) + ρ x 2 2 is convex. 1Division of Decision and Control Systems, EECS, KTH Royal Institute of Technology, Stockholm, Sweden. Correspondence to: V. V. Mai , M. Johansson . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). This function class is very rich and important in optimization (Rockafellar, 1982; Vial, 1983). It trivially includes all convex functions, all smooth functions with Lipschitz continuous gradient, and all additive composite functions of the two former classes. More broadly, it includes all compositions of the form f(x) = h(c(x)), (2) where h : Rm R is convex and Lh-Lipschitz and c : Rn Rm is a smooth map with Lc-Lipschitz Jacobian. Indeed, the composite function f = h c is then weakly convex with ρ = Lh Lc (Drusvyatskiy & Paquette, 2019). Some representative applications in this problem class include nonlinear least squares (Drusvyatskiy, 2017), robust phase retrieval (Duchi & Ruan, 2018a), Robust PCA (Cand es et al., 2011), robust low rank matrix recovery (Charisopoulos et al., 2019), optimization of the Conditional Value-at-Risk (Rockafellar & Uryasev, 2000), graph synchronization (Singer, 2011), and many others. Stochastic optimization algorithms for solving (1), based on random samples Sk drawn from P, are of fundamental importance in many applied sciences (Bottou, 2010; Shapiro et al., 2014). Since the introduction of the classical stochastic (sub)gradient descent method (SGD) in (Robbins & Monro, 1951), several modifications of SGD have been proposed to improve its practical and theoretical performance. A notable example is the use of a momentum term to construct an update direction (Gupal & Bazhenov, 1972; Polyak, 1987; Ruszczynski & Syski, 1983; Tseng, 1998; Sutskever et al., 2013; Ghadimi & Lan, 2016). The basis form of such a method (when X = Rn) reads: xk+1 = xk αkzk (3a) zk+1 = βkgk+1 + (1 βk)zk, (3b) where zk is the search direction, gk is a stochastic subgradient; αk is the stepsize, and βk (0, 1] is the momentum parameter. For instance, the scheme (3) reduces to the stochastic heavy ball method (SHB) (Polyak, 1987): xk+1 = xk ηkgk + λk(xk xk 1), with ηk = αkβk 1 and λk = (1 βk 1)αk/αk 1. Methods of this type enjoy widespread empirical success in largescale convex and non-convex optimization, especially in Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization training neural networks, where they have been used to produce several state-of-the-art results on important learning tasks, e.g., (Krizhevsky et al., 2012; Sutskever et al., 2013; He et al., 2016; Huang et al., 2017). Sample complexity, namely, the number of observations S0, . . . , SK required to reach a desired accuracy ϵ, has been the most widely adapted metric for evaluating the performance of stochastic optimization algorithms. Although sample complexity results for the standard SGD on problems of the form (1) have been obtained for convex and/or smooth problems (Nemirovski et al., 2009; Ghadimi & Lan, 2013), much less is known in the non-smooth non-convex case (Davis & Drusvyatskiy, 2019). The problem is even more prevalent in momentum-based methods as there is virtually no known complexity results for problems beyond those that are convex or smooth. 1.1. Related work As many applications in modern machine learning and signal processing cannot be captured by convex models, (stochastic) algorithms for solving non-convex problems have been studied extensively. Below we review some of the topics most closely related to our work. Stochastic weakly convex minimization Earlier works on this topic date back to Nurminskii who showed subsequential convergence to stationary points for the subgradient method applied to deterministic problems (Nurminskii, 1973). The work (Ruszczy nski, 1987) proposes a stochastic gradient averaging-based method and shows the first almost sure convergence for this problem class. Basic sufficient conditions for convergence of stochastic projected subgradient methods is established in (Ermol ev & Norkin, 1998). Thanks to the recent advances in statistical learning and signal processing, the problem class has been reinvigorated with several new theoretical results and practical applications (see, e.g., (Duchi & Ruan, 2018a; 2016; Davis & Grimmer, 2019; Davis & Drusvyatskiy, 2019) and references therein). In particular, based on the theory of non-convex differential inclusions, almost sure convergence is derived in (Duchi & Ruan, 2018b) for a collection of model-based minimization strategies, albeit no rates of convergence are given. An important step toward non-asymptotic convergence of stochastic methods is made in (Davis & Grimmer, 2019). There, the authors employ a proximal point technique for which they can show the sample complexity O(1/ϵ2) with a certain stationarity measure. Later, the work (Davis & Drusvyatskiy, 2019) shows that the (approximate) proximal point step in (Davis & Grimmer, 2019) is not necessary and establishes the similar complexity for a class of model-based methods including the standard SGD. We also note that there has been a large volume of works in smooth non-convex optimization, e.g., (Ghadimi & Lan, 2013; 2016). Stochastic momentum for non-convex functions Optimization algorithms based on momentum averaging techniques go back to Polyak (1964) who proposed the heavy ball method. In (1983), Nesterov introduced the accelerated gradient method and showed its optimal iteration complexity for the minimization of smooth convex functions. In the last few decades, research on accelerated first-order methods has exploded both in theory and in practice (Beck, 2017; Nesterov, 2018; Bubeck, 2015). The effectiveness of such techniques in the deterministic context has inspired researchers to incorporate momentum terms into stochastic optimization algorithms (Polyak, 1987; Ruszczy nski, 1987; Tseng, 1998; Sutskever et al., 2013; Ghadimi & Lan, 2016). Despite evident success, especially, in training neural networks (Krizhevsky et al., 2012; Sutskever et al., 2013; He et al., 2016; Zagoruyko & Komodakis, 2016; Huang et al., 2017), the theory for stochastic momentum methods is not as clear as its deterministic counterpart (cf. Gitman et al. (2019)). As a result, there has been a growing interest in obtaining convergence guarantees for those methods under noisy gradients (Hu et al., 2009; Gitman et al., 2019; Yan et al., 2018; Gadat et al., 2018; Ghadimi & Lan, 2016). In non-convex optimization, almost certain convergence of Algorithm (3) for smooth and unconstrained problems is derived in (Ruszczynski & Syski, 1983). Under the bounded gradient hypothesis, the convergence rate of the same algorithm has been established in (Yan et al., 2018). The work (Ghadimi et al., 2020) obtains the complexity of a gradient averaging-based method for constrained problems. In (Ghadimi & Lan, 2016), the authors study a variant of Nesterov acceleration and establish a similar complexity for smooth and unconstrained problems, while for the constrained case, a mini-batch of samples at each iteration is required to guarantee convergence. 1.2. Contributions Minimization of weakly convex functions has been a challenging task, especially for stochastic problems, as the objective is neither smooth nor convex. With the recent breakthrough in (Davis & Drusvyatskiy, 2019), this problem class has been the widest one for which provable sample complexity of the standard SGD is known. It is thus intriguing to ask whether such a result can also be obtained for momentumbased methods. The work in this paper aims to address this question. To that end, we make the following contributions: We establish the sample complexity of a stochastic subgradient method with momentum of Polyak type for a broad class of non-smooth, non-convex, and constrained optimization problems. Concretely, using a special Lyapunov function, we show the complexity O(1/ϵ2) for the minimization of weakly convex functions. The proven complexity is attained in a parameter-free and single Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization time-scale fashion, namely, the stepsize and the momentum constant are independent of any problem parameters and they have the same scale with respect to the iteration count. To the best of our knowledge, this is the first complexity guarantee for a stochastic method with momentum on non-smooth and non-convex problems. We also study the sample complexity of the considered algorithm for smooth and constrained optimization problems. Note that even in this setting, no complexity guarantee of SGD with Polyak momentum has been established before. Under a bounded gradient assumption, we obtain a similar O(1/ϵ2) complexity without the need of forming a batch of samples at each iteration, which is commonly required for constrained non-convex stochastic optimization (Ghadimi et al., 2016). We then demonstrate how the unconstrained case can be analyzed without the above assumption. Interestingly, the stated result is achieved in the regime where β can be as small as O(1/ K), i.e., one can put much more weight to the momentum term than the fresh subgradient in a search direction. This complements the complexity of SGD attained as β 1. Note that the worstcase complexity O(1/ϵ2) is unimprovable in the smooth and unconstrained case (Arjevani et al., 2019). 2. Background In this section, we first introduce the notation and then provide the necessary preliminaries for the paper. For any x, y Rn, we denote by x, y the Euclidean inner product of x and y. We use 2 to denote the Euclidean norm. For a closed and convex set X, ΠX denotes the orthogonal projection onto Z, i.e., y = ΠX (x) if y X and y x 2 = minz X z x 2; IX ( ) denotes the indicator function of X, i.e., IX (x) = 0 if x X and + otherwise. Finally, we denote by Fk := σ(S0, . . . , Sk) the σ-field generated by the first k + 1 random variables S0, . . . , Sk. For a function f : Rn R {+ }, the Fr echet subdifferential of f at x, denoted by f(x), consists of all vectors g Rn such that f(y) f(x) + g, y x + o( y x ) as y x. The Fr echet and conventional subdifferentials coincide for convex functions, while for smooth functions f, f(x) reduces to the gradient { f(x)}. A point x Rn is said to be stationary for problem (1) if 0 f(x) + IX (x). The following lemma collects standard properties of weakly convex functions (Vial, 1983). Lemma 2.1 (Weak convexity). Let f : Rn R {+ } be a ρ-weakly convex function. Then the following hold: 1. For any x, y Rn with g f(x), we have f(y) f(x) + g, y x ρ 2 y x 2 2 . 2. For all x, y Rn, α [0, 1], and z = αx + (1 α)y: f(z) αf(x) + (1 α)f(y) + ρα(1 α) 2 x y 2 2 . Weakly convex functions admit an implicit smooth approximation through the Moreau envelope: fλ(x) = inf y Rn For small enough λ, the point achieving fλ(x) in (4), denoted by proxλf (x), is unique and given by: proxλf (x) = argmin y Rn The lemma below summarizes two well-known properties of the Moreau envelope and its associated proximal map (Hiriart-Urruty & Lemar echal, 1993). Lemma 2.2 (Moreau envelope). Suppose that f : Rn R {+ } is a ρ-weakly convex function. Then, for a fixed parameter λ 1 > ρ, the following hold: 1. fλ is C1-smooth with the gradient given by fλ(x) = λ 1 x proxλf (x) . 2. fλ is λ 1-smooth, i.e., for all x, y Rn: fλ(y) fλ(x) fλ(x), y x 1 2λ x y 2 2 . Failure of stationarity test A major source of difficulty in convergence analysis of non-smooth optimization methods is the lack of a controllable stationarity measure. For smooth functions, it is natural to use the norm of the gradient as a surrogate for near stationarity. However, this rule does not make sense in the non-smooth case, even if the function is convex and its gradient existed at all iterates. For example, the convex function f(x) = |x| has f(x) 2 = 1 at each x = 0, no mater how close x is to the stationary point x = 0. To circumvent this difficulty, we adopt the techniques pioneered in (Davis & Drusvyatskiy, 2019) for convergence of stochastic methods on weakly convex problems. More concretely, we rely on the connection of the Moreau envelope to (near) stationarity: For any x Rn, the point ˆx = proxλF (x), where F(x) = f(x) + IX (x), satisfies: ( x ˆx 2 = λ Fλ(x) 2 , dist(0, F(ˆx)) Fλ(x) 2 . (6) Therefore, a small gradient Fλ(x) 2 implies that x is close to a point ˆx X that is near-stationary for F. Note that ˆx is just a virtual point, there is no need to compute it. Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization 3. Algorithm and convergence analysis We assume that the only access to f is through a stochastic subgradient oracle. In particular, we study algorithms that attempt to solve problem (1) using i.i.d. samples S0, S1, . . . , SK iid P. Assumption A (Stochastic oracle). Fix a probability space (S, F, P). Let S be a sample drawn from P and f (x, S) f(x, S). We make the following assumptions: (A1) For each x dom(f), we have EP [f (x, S)] f(x). (A2) There exists a real L > 0 such that for all x X: EP h f (x, S) 2 2 i L2. The above assumptions are standard in stochastic optimization of non-smooth functions (see, e.g., (Nemirovski et al., 2009; Davis & Drusvyatskiy, 2019)). Algorithm To solve problem (1), we use an iterative procedure that starts from x0 X, z0 f(x0, S0) and generates sequences of points xk X and zk Rn by repeating the following steps for k = 0, 1, 2, . . .: xk+1 = argmin x X zk, x xk + 1 2α x xk 2 2 zk+1 = βgk+1 + (1 β)xk xk+1 where gk+1 f(xk+1, Sk+1). When X = Rn, this algorithm reduces to the procedure (3). For a general convex set X, this scheme is known as the i Piano method in the smooth and deterministic setting (Ochs et al., 2014). For simplicity, we refer to Algorithm 7 as stochastic heavy ball (SHB). Throughout the paper, we will frequently use the following two quantities: β (xk xk 1) and dk = 1 α (xk 1 xk) . Before detailing our convergence analysis, we note that most proofs of O(1/ϵ2) sample complexity for subgradientbased methods rely on establishing an iterate relationship on the form (see, e.g., (Nemirovski et al., 2009; Davis & Drusvyatskiy, 2019; Ghadimi & Lan, 2013)): E[Vk+1] E[Vk] α E[ek] + α2C2, (8) where ek denotes some stationarity measure such as f( ) f for convex and f( ) 2 2 for smooth (possibly nonconvex) problems, Vk are certain Lyapunov functions, α is the stepsize, and C is some constant. Once (8) is given, a simple manipulation results in the desired complexity provided that α is chosen appropriately. The case with decaying stepsize can be analyzed in the same way with minor adjustments. We follow the same route and identify a Lyapunov function that allows to establish relation (8) for the quantity Fλ( ) 2 in (6). Since our Lyapunov function is nontrivial, we shall build it up through a series of key results. We start by presenting the following lemma, which quantifies the averaged progress made by one step of the algorithm. Lemma 3.1. Let Assumptions (A1) (A2) hold. Let β = να for some constant ν > 0 such that β (0, 1]. Let xk be generated by procedure (7). It holds for any k N that (1 β)f(xk) + E hν 2 pk+1 2 2 |Fk 1 i (1 β)f(xk 1) + ν 2 pk 2 2 α E h dk+1 2 2 |Fk 1 i + α2 ρ(1 β) 2 + ν L2. (9) Proof. See Appendix A. In view of (8), the lemma shows that the quantity E[ dk 2 2] can be made arbitrarily small. However, this alone is not sufficient to show convergence to stationary points. Nonetheless, we shall show that a small E[ dk 2 2] indeed implies a small (averaged) value of the norm of the Moreau envelope defined at a specific point. Toward this goal, we first need to detail the points x and ˆx in (6). It seems that taking the most natural candidate x = xk is unlikely to produce the desired result. Instead, we rely on the following iterates: xk := xk + 1 β β (xk xk 1) , and construct corresponding virtual reference points: ˆxk = argmin x Rn 2λ x xk 2 2 for λ < 1/ρ. By Lemma 2.2, Fλ( xk) = λ 1( xk ˆxk), where Fλ( ) is the Moreau envelope of F( ) = f( ) + IX ( ). With these definitions, we can now state the next lemma. Lemma 3.2. Assume the same setting of Lemma 3.1. Let λ > 0 be such that λ 1 2ρ. Let ξ = (1 β)/ν and define the function: Vk = Fλ( xk) + νξ2 4λ2 pk 2 2 + αξ2 f(xk 1). (10) Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization Then, for any k N, E [Vk+1|Fk 1] Vk α 2 Fλ( xk) 2 2 + γα2L2 where γ = ξ2(ρ(1 β)/2 + ν)/λ + ρξ/2 + 1. The proof of this lemma is rather involved and can be found in Appendix B. Lemma 3.2 has established a relation akin to (8) with the Lyapunov function Vk defined in (10). We can now use standard analysis to obtain our sample complexity. Theorem 1. Let Assumptions (A1)-(A2) hold. Let k be sampled uniformly at random from {0, . . . , K}. Let f = infx X f(x) and denote = f(x0) f . If we set α = α0 K+1 and ν = 1/α0 for some real α0 > 0, then under the same setting of Lemma 3.2: E h Fλ( xk ) 2 2 i 2 γ1 + γL2 K + 1 , (12) where γ ρ2α2 0 + 3ρα0 + 1 and γ1 2ρ2α2 0 + 2ρα0 + 1. Furthermore, if α0 is set to 1/ρ, we obtain E h F1/(2ρ)( xk ) 2 2 i 10 ρ + L2 Proof. Taking the expectation on both sides of (11) and summing the result over k = 0, . . . , K yield E[VK+1] V0 α0 2 k=0 E h Fλ( xk) 2 2 i + γL2α2 0 2λ . Let γ1 = 1 + (1 β)ξ2/(2λ2) + ξ/λ, the left-hand-side of the above inequality can be lower-bounded by γ1f . Using the facts that Fλ(x0) f(x0) and x 1 = x0, we get V0 γ1f(x0). Consequently, E h Fλ( xk ) 2 2 i 2 γ1 + γL2α2 0 2λ α0 where the last expectation is taken with respect to all random sequences generated by the method and the uniformly distributed random variable k . Note that ν = 1/α0, ξ = (1 β)/ν, and 1 β 1. Thus, letting λ = 1/(2ρ), the constants γ and γ1 can be upper-bounded by ρ2α2 0+3ρα0+1 and 2ρ2α2 0 + 2ρα0 + 1, respectively. Therefore, if we let α0 = 1/ρ, we arrive at E h F1/(2ρ)( xk ) 2 2 i 10 ρ + L2 as desired. Some remarks regarding Theorem 1 are in order: i) The choice ν = 1/α0 is just for simplicity; one can pick any constant such that β = να (0, 1]. Note that the stepsize used to achieve the rate in (12) does not depend on any problem parameters. Once α is set, the momentum parameter selection is completely parameter-free. Since both α and β scale like O(1/ K), Algorithm 7 can be seen as a single time-scale method (Ghadimi et al., 2020; Ruszczynski & Syski, 1983). Such methods contrast those that require at least two time-scales to ensure convergence. For example, stochastic dual averaging for convex optimization (Xiao, 2010) requires one fast scale O(1/K) for averaging the subgradients, and one slower scale O(1/ K) for the stepsize. To show almost sure convergence of SHB for smooth and unconstrained problems, the work (Gitman et al., 2019) requires that both the stepsize and the momentum parameter tend to zero but the former one must do so at a faster speed.1 ii) To some extent, Theorem 1 supports the use of a small momentum parameter such as β = 0.1 or β = 0.01, which corresponds to the default value 1 β = 0.9 in Py Torch2 or the smaller 1 β = 0.99 suggested in (Goh, 2017). Indeed, the theorem allows to have β as small as O(1/ K), i.e., one can put much more weight to the momentum term than the fresh subgradient and still preserve the complexity. Recall also that SHB reduces to SGD as β 1, which also admits a similar complexity. It is thus quite flexible to set β, without sacrificing the worst-case complexity. We refer to (Gitman et al., 2019, Theorem 2) for a similar discussion in the context of almost sure convergence on smooth problems. iii) In view of (6), the theorem indicates that xk is nearby a near-stationary point ˆxk. Since xk may not belong to X, it is thus more preferable to have the similar guarantee for the iterate xk. Indeed, we have λ 2 xk ˆxk 2 2 2λ 2 xk ˆxk 2 2 + 2λ 2 xk xk 2 2 = 2 Fλ( xk) 2 2 + 2λ 2 xk xk 2 2 = 2 Fλ( xk) 2 2 + 2λ 2ξ2 dk 2 2 . Since both terms on the right converge at the rate O(1/ K), it immediately translates into the same guarantee for the term on the left, as desired. In summary, we have established the convergence rate O(1/ K) or, equivalently, the sample complexity O(1/ϵ2) of SHB for the minimization of weakly convex functions. 4. Extension to smooth non-convex functions In this section, we study the convergence property of Algorithm (7) for the minimization of ρ-smooth functions: f(x) f(x) 2 ρ x y 2 , x, y dom f. 1Note that our β corresponds to 1 β in (Gitman et al., 2019). 2https://pytorch.org/ Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization Note that ρ-smooth functions are automatically ρ-weakly convex. In this setting, it is more common to replace Assumption (A2) by the following. Assumption (A3). There exists a real σ > 0 such that for all x X: E h f (x, S) f(x) 2 2 i σ2. Deriving convergence rates of stochastic schemes with momentum for non-convex functions under Assumption (A3) can be quite challenging. Indeed, even in unconstrained optimization, previous studies often need to make the assumption that the true gradient is bounded, i.e., f(x) 2 G for all x Rn (see, e.g., (Yan et al., 2018; Gitman et al., 2019)). This assumption is strong and does not hold even for quadratic convex functions. It is more realistic in constrained problems, for example when X is compact, albeit the constant G could then be large. Our objective in this section is twofold: First, we aim to extend the convergence results of SHB in the previous section to smooth optimization problems under Assumption (A3). Note that even in this setting, the sample complexity of SHB has not been established before. The rate is obtained without the need of forming a batch of samples at each iteration, which is commonly required for constrained non-convex stochastic optimization (Ghadimi & Lan, 2016; Ghadimi et al., 2016). Second, for unconstrained problems, we demonstrate how to achieve the same complexity without the bounded gradient assumption above. Let h(x) = 1 2 x 2 2 + IX (x) and let h (z) be its convex conjugate. Our convergence analysis relies on the function: ϕk = h (xk αzk) 1 2 xk 2 2 + α xk, zk . (13) The use of this function is inspired by (Ruszczy nski, 1987). Roughly speaking, ϕk is the negative of the optimal value of the function on the RHS of (7a), and hence, ϕk 0 for all k. This function also underpins the analysis of the dual averaging scheme in (Nesterov, 2009). The following result plays a similar role as Lemma 3.1. Lemma 4.1. Let Assumptions (A1) and (A3) hold. Let α (0, 1/ρ) and β = να for some constant ν > 0 such that β (0, 1]. Let α (0, 1/(4ρ)] and ξ = (1 β)/ν, and define the function: Wk = 2f(xk) + ϕk Then, it holds for any k N that E [Wk+1|Fk] Wk α dk+1 2 2 + 4να2σ2. (14) Proof. Since the proof is rather technical, we defer details to Appendix C and sketch only the main arguments here. By smoothness of h , weak convexity of f and the optimality condition for the update formula (7a) we get E h f(xk+1) + ϕk+1 Fk i f(xk) + ϕk 2 ) dk+1 2 2 + 1 2ν E h zk zk+1 2 2 |Fk i . The proof of this relation can be found in Lemma C.1. The preceding inequality admits very useful properties as we have terms that form a telescoping sum, and the constant associated with dk+1 2 2 has the right order-dependence on the stepsize. However, we still have a remaining term zk zk+1 2 2. Thus, in view of relation (8), our next strategy is to bound this term in a way that still keeps all the favourable features described above, and at the most introduces an additional term of order O(α2σ2). As shown in Lemma C.2, we can establish the following inequality 2ν zk+1 zk 2 2 |Fk f(xk) f(xk+1) α α3ρ2 + 3ρα2 dk+1 2 2 + 4να2σ2. Now, (14) follows immediately from combining the two previous inequalities and the fact that α (0, 1/(4ρ)]. We remark that Lemma 4.1 does not require the bounded gradient assumption and readily indicates the convergence rate O(1/ K) for E[ dk 2 2]. However, to establish the rate for E[ Fλ( xk) 2 2], we need to impose such an assumption in the theorem below. Nonetheless, the assumption is much more realistic in this setting than the unconstrained case. Theorem 2. Let Assumptions (A1) and (A3) hold. Assume further that f(x) 2 G for all x X. Let k , xk , λ, , γ, and γ1 be defined as in Theorem 1. If we set α = α0 K+1 and ν = 1/α0 for some real α0 > 0, then E h Fλ( xk ) 2 2 i 2 γ1 + γ(σ2 + G2)/(2λ) Furthermore, if α0 is set to 1/ρ, we obtain E h F1/(2ρ)( xk ) 2 2 i 10 ρ + σ2 + G2 Proof. The proof is a verbatim copy of that of Theorem 1 with L2 replaced by σ2 + G2; see Appendix D. Some remarks are in order: i) To the best of our knowledge, this is the first convergence rate result of a stochastic (or even deterministic) method Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization with Polyak momentum for smooth, non-convex, and constrained problems. ii) The algorithm enjoys the same single time-scale and parameter-free properties as in the non-smooth case. iii) The rate in the theorem readily translates into an analogous estimate for the norm of the so-called gradient mapping G1/ρ, which is commonly adapted in the literature, e.g., (Ghadimi et al., 2016). This is because for ρ-smooth functions (Drusvyatskiy & Paquette, 2019): G1/ρ(x) 2 3 2) F1/(2ρ)(x) 2 , x Rn. Since the bounded gradient assumption is rather restrictive in the unconstrained case, our final result shows how the desired complexity can be attained without this assumption. Theorem 3. Let Assumptions (A1) and (A3) hold. Let λ 1 (3ρ/2, 2ρ]. Let k be sampled uniformly at random from { 1, . . . , K 1}. If we set α = α0 K+1 and ν = 1/α0, where α0 (0, 1/(4ρ)], then under the same setting of Lemma 4.1: E h Fλ( xk ) 2 2 i c (1 + 2α2 0/λ2) + (1+8α0/λ)σ2α2 0 2λ α0 where c = 2λ 1/(2λ 1 3ρ). Furthermore, let λ = 1/(2ρ), we obtain E h F1/(2ρ)( xk ) 2 2 1 + 8ρ2α2 0 + (ρ + 16α0ρ2)σ2α3 0 α0 Proof. See Appendix E. It should be mentioned that a similar result has been attained very recently in (Ghadimi et al., 2020) using a different analysis, albeit no sample complexity is given for the nonsmooth case. It is still an open question if one can preserve the complexity in Theorem 2 without the bounded gradient hypothesis. 5. Numerical evaluations In this section, we perform experiments to validate our theoretical developments and to demonstrate that despite sharing the same worst-case complexity, SHB can be better in terms of speed and robustness to problem and algorithm parameters than SGD. We consider the robust phase retrieval problem (Duchi & Ruan, 2018a;b): Given a set of m measurements (ai, bi) Rn R, the phase retrieval problem seeks for a vector x such that ai, x 2 bi for most i = 1, . . . , m. Whenever (a) κ = 10, α0 = 0.1 (b) κ = 10, α0 = 0.25 (c) κ = 1, α0 = 0.1 (d) κ = 1, α0 = 0.15 Figure 1. The function gap f(xk) f(x ) versus iteration count for phase retrieval with pfail = 0.2, β = 10/ the problem is corrupted with gross outliers, a natural exact penalty form of this (approximate) system of equations yields the minimization problem: minimize x Rn 1 m ai, x 2 bi . This objective function is non-smooth and non-convex. In view of (2), it is the composition of the Lipschitz-continuous convex function h(y) = y 1 and the smooth map c with ci(x) = ai, x 2 bi. Hence, it is weakly convex. In each experiment, we set m = 300, n = 100 and select x uniformly from the unit sphere. We generate A as A = QD, where Q Rm n is a matrix with standard normal distributed entries, and D is a diagonal matrix with linearly spaced elements between 1/κ and 1, with κ 1 playing the role of a condition number. The elements bi of the vector b are generated as bi = ai, x 2 + δζi, i = 1, . . . , m, where ζi N(0, 25) models the corruptions, and δ {0, 1} is a binary random variable taking the value 1 with probability pfail, so that pfail m measurements are corrupted. The algorithms are all randomly initialized at x0 N(0, 1). The stochastic subgradient is simply given as an element of the subdifferential of g(x) = a, x 2 b : g(x) = 2 a, x a ( sign ( a, x b)2 if a, x 2 = b, [ 1, 1] otherwise. In each of our experiments, we set the stepsize as αk = α0/ k + 1, where α0 is an initial stepsize. We note that this stepsize can often make a little faster progress (for both SGD and SHB) at the beginning of the optimization process than the constant one αk = α0/ K + 1. However, after a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization (a) pfail = 0.3, β = 1/ (b) pfail = 0.3, β = 1/α0/ Figure 2. The number of epochs to achieve ϵ-accuracy versus initial stepsize α0 for phase retrieval with κ = 10. (a) pfail = 0.3, β = 0.1 (b) pfail = 0.3, β = 0.01 Figure 3. The number of epochs to achieve ϵ-accuracy versus initial stepsize α0 for phase retrieval with κ = 10 and popular choices of β. few iterations, both of them yield very similar results and will not change the qualitative aspects of our plots in any way. We also refer m stochastic iterations as one epoch (pass over the data). Within each individual run, we allow the considered stochastic methods to perform K = 400m iterations. We conduct 50 experiments for each stepsize and report the median of the epochs-to-ϵ-accuracy; the shaded areas in each plot cover the 10th to 90th percentile of convergence times. Figure 1 shows the function gap versus iteration count for different values of κ and α0, with pfail = 0.2, β = 10/ K. It is evident that SHB converges with a theoretically justified parameter β and is much less sensitive to problem and algorithm parameters than the vanilla SGD. Note that the sensitivity issue of SGD is rather well documented; a slight change in its parameters can have a severe effect on the overall performance of the algorithm (Nemirovski et al., 2009; Asi & Duchi, 2019). For example, Fig. 1d shows that SGD exhibits a transient exponential growth before eventual convergence. This behaviour can occur even when minimizing the smooth quadratic function 1 2x2 (Asi & Duchi, 2019, Example 2). In contrast, SHB converges in all settings of the figure, suggesting that using a momentum term can help to improve the robustness of the standard SGD. This is expected as the update formula (3b) acts like a lowpass filter, averaging past stochastic subgradients, which may have stabilizing effect on the sequence {xk}. To further clarify this observation, in the next set of experiments, we test the sensitivity of SHB and SGD to the initial stepsize α0. Figure 2 shows the number of epochs required to reach ϵ-accuracy for phase retrieval with κ = 10 and pfail = 0.3. We can see that the standard SGD has good performance for a narrow range of stepsizes, while wider convergence range can be achieved with SHB. Finally, it would be incomplete without reporting experiments for some of the most popular momentum parameters used by practitioners. Figure 3 shows a similar story to Fig. 2 for the parameter 1 β = 0.9 and 1 β = 0.99 as discussed in remark ii) after Theorem 1. This together with Fig. 2 demonstrates that SHB is able to find good approximate solutions for diverse values of the momentum constant over a wider (often significantly so) range of algorithm parameters than SGD. Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization 6. Conclusion Using a carefully constructed Lyapunov function, we established the first sample complexity results for the SHB method on a broad class of non-smooth, non-convex, and constrained optimization problems. The complexity is attained in a parameter-free fashion in a single time-scale. A notable feature of our results is that they justify the use of a large amount of momentum in search directions. We also improved some complexity results for SHB on smooth problems. Numerical results show that SHB exhibits good performance and low sensitivity to problem and algorithm parameters compared to the standard SGD. Acknowledgements This work was supported in part by the Knut and Alice Wallenberg Foundation, the Swedish Research Council and the Swedish Foundation for Strategic Research. Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. ar Xiv preprint ar Xiv:1912.02365, 2019. Asi, H. and Duchi, J. C. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257 2290, 2019. Beck, A. First-order methods in optimization, volume 25. SIAM, 2017. Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pp. 177 186. Springer, 2010. Bubeck, S. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4): 231 357, 2015. Cand es, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of the ACM (JACM), 58(3): 1 37, 2011. Charisopoulos, V., Chen, Y., Davis, D., D ıaz, M., Ding, L., and Drusvyatskiy, D. Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence. ar Xiv preprint ar Xiv:1904.10020, 2019. Davis, D. and Drusvyatskiy, D. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019. Davis, D. and Grimmer, B. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization, 29(3):1908 1930, 2019. Drusvyatskiy, D. The proximal point method revisited. ar Xiv preprint ar Xiv:1712.06038, 2017. Drusvyatskiy, D. and Paquette, C. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178(1-2):503 558, 2019. Duchi, J. C. and Ruan, F. Asymptotic optimality in stochastic optimization. ar Xiv preprint ar Xiv:1612.05612, 2016. Duchi, J. C. and Ruan, F. Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval. Information and Inference: A Journal of the IMA, 8(3):471 529, 2018a. Duchi, J. C. and Ruan, F. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229 3259, 2018b. Ermol ev, Y. M. and Norkin, V. Stochastic generalized gradient method for nonconvex nonsmooth stochastic optimization. Cybernetics and Systems Analysis, 34(2): 196 215, 1998. Gadat, S., Panloup, F., and Saadane, S. Stochastic heavy ball. Electronic Journal of Statistics, 12(1):461 529, 2018. Ghadimi, E., Feyzmahdavian, H. R., and Johansson, M. Global convergence of the heavy-ball method for convex optimization. In 2015 European Control Conference (ECC), pp. 310 315. IEEE, 2015. Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. Ghadimi, S. and Lan, G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59 99, 2016. Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155 (1-2):267 305, 2016. Ghadimi, S., Ruszczy nski, A., and Wang, M. A single timescale stochastic approximation method for nested stochastic optimization. SIAM J. on Optimization, 2020. Accepted for publication (ar Xiv preprint ar Xiv:1812.01094). Gitman, I., Lang, H., Zhang, P., and Xiao, L. Understanding the role of momentum in stochastic gradient methods. In Advances in Neural Information Processing Systems, pp. 9630 9640, 2019. Goh, G. Why momentum really works. Distill, 2017. doi: 10.23915/distill.00006. URL http://distill. pub/2017/momentum. Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization Gupal, A. M. and Bazhenov, L. G. Stochastic analog of the conjugant-gradient method. Cybernetics and Systems Analysis, 8(1):138 140, 1972. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hiriart-Urruty, J.-B. and Lemar echal, C. Convex analysis and minimization algorithms, volume 305. Springer science & business media, 1993. Hu, C., Pan, W., and Kwok, J. T. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pp. 781 789, 2009. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009. Nesterov, Y. A method for solving the convex programming problem with convergence rate O(1/k2). In Dokl. akad. nauk Sssr, volume 269, pp. 543 547, 1983. Nesterov, Y. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221 259, 2009. Nesterov, Y. Lectures on convex optimization, volume 137. Springer, 2018. Nurminskii, E. A. The quasigradient method for the solving of the nonlinear programming problems. Cybernetics, 9 (1):145 150, 1973. Ochs, P., Chen, Y., Brox, T., and Pock, T. i Piano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences, 7(2):1388 1419, 2014. Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964. Polyak, B. T. Introduction to optimization. Optimization Software, 1987. Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951. Rockafellar, R. T. Favorable classes of Lipschitz continuous functions in subgradient optimization. In Nurminski, E. A. (ed.), Progress in Nondifferentiable Optimization, CP-82-S8, pp. 125 143, 1982. Rockafellar, R. T. and Uryasev, S. Optimization of conditional value-at-risk. Journal of risk, (2):21 42, 2000. Ruszczy nski, A. A linearization method for nonsmooth stochastic programming problems. Mathematics of Operations Research, 12(1):32 49, 1987. Ruszczynski, A. and Syski, W. Stochastic approximation method with gradient averaging for unconstrained problems. IEEE Transactions on Automatic Control, 28(12): 1097 1105, 1983. Shapiro, A., Dentcheva, D., and Ruszczy nski, A. Lectures on stochastic programming: modeling and theory. SIAM, 2014. Singer, A. Angular synchronization by eigenvectors and semidefinite programming. Applied and computational harmonic analysis, 30(1):20 36, 2011. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139 1147, 2013. Tseng, P. An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506 531, 1998. Vial, J.-P. Strong and weak convexity of sets and functions. Mathematics of Operations Research, 8(2):231 259, 1983. Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543 2596, 2010. Yan, Y., Yang, T., Li, Z., Lin, Q., and Yang, Y. A unified analysis of stochastic momentum methods for deep learning. In International Joint Conference on Artificial Intelligence, 2018. Zagoruyko, S. and Komodakis, N. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016. Zavriev, S. and Kostyuk, F. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling, 4(4):336 341, 1993.