# stochastic_weakly_convex_optimization_beyond_lipschitz_continuity__e008dba9.pdf Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Wenzhi Gao 1 Qi Deng 2 This paper considers stochastic weakly convex optimization without the standard Lipschitz continuity assumption. Based on new robust regularization (stepsize) strategies, we show that a wide class of stochastic algorithms, including the stochastic subgradient method, preserve the O(1/ K) convergence rate with constant failure rate. Our analyses rest on rather weak assumptions: the Lipschitz parameter can be either bounded by a general growth function of x or locally estimated through independent random samples. Numerical experiments demonstrate the efficiency and robustness of our proposed stepsize policies. 1. Introduction This paper studies the stochastic optimization problem min x Rn ψ(x) := f(x) + ω(x), (1) where f(x) := Eξ Ξ[f(x, ξ)]. Here, f(x, ξ) is a continuous function in x, with ξ being a random sample drawn from a particular distribution Ξ. The function ω(x) is lower-semicontinuous, and its proximal mapping is easy to evaluate. We assume both f(x, ξ) and ω(x) are weakly convex functions. A function g is defined as λ-weakly convex if g + λ 2 2 is convex, for some λ 0. When λ is unspecified, g is called weakly convex. Weak convexity has found many important applications, including phase retrieval, robust PCA, reinforcement learning, and many others (Duchi and Ruan, 2019; Charisopoulos et al., 2021; Wang et al., 2023). And recent years witnessed a surge in interest regarding weakly convex optimization, leading to a substantial body of work on efficient algorithms with finite time complexity guarantees (Davis and Drusvyatskiy, 1Institute for Computational and Mathematical Engineering, Stanford University 2Antai College of Economics and Management, Shanghai Jiao Tong University. Correspondence to: Wenzhi Gao , Qi Deng . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). 2019; Davis et al., 2018b; Deng and Gao, 2021; Davis et al., 2019; Mai and Johansson, 2020). In particular, under the global Lipschitz continuity assumption, Davis and Drusvyatskiy (2019) develop a model-based approach and analyze the convergence of several stochastic algorithms under a unified framework. While this global Lipschitz assumption is valid for many problems, such as piece-wise linear functions, it can be overly restrictive. To illustrate, consider the weakly convex function ψ(x) = |ex + e x 3| whose subgradient ψ (x) explodes exponentially as x grows (Figure 1). Hence, treating the Lipschitz constant as any fixed constant in algorithm design can lead to highly unstable iterations and, potentially, to the algorithm divergence. 1.5 1 0.5 0.5 1 1.5 Figure 1: Exponential growth of ψ(x) = |ex + e x 3| To address this issue, a straightforward strategy is to impose an explicit convex compact set constraint, such as {x : x B} to address this issue. However, this introduces extra parameter tuning and may lead to a significantly overestimated Lipschitz constant. The latter phenomenon is evident in the toy example, where the Lipschitz constant grows exponentially with the domain set s diameter. One research direction to deal with non-globally Lipschitz settings is shifting from standard Euclidean geometry to Bregman divergence. Lu (2019) shows that when convex non-Lipschitz functions exhibit relative Lipschitz continuity under a carefully chosen divergence kernel, mirror descent still obtains the desired sublinear rate of convergence to optimality defined in the sense of Bregman divergence. However, there are trade-offs to consider. Compared to SGD, a mirror descent update is more expensive, often involving a nontrivial root-finding procedure. Additionally, choosing the right kernel is a nuanced and critical task, heavily reliant on an in-depth understanding of the Stochastic Weakly Convex Optimization beyond Lipschitz Continuity subgradient s growth dynamics (Davis et al., 2018a; Zhang and He, 2018). Alternatively, recent works aim to develop new algorithms/analyses under relaxed Lipschitz assumptions. For example, Asi and Duchi (2019) show that for the stochastic proximal point method, algorithmic dependency on the global Lipschitz constant can be relaxed to E[ f (x , ξ) 2], which only relies on the optimal solution x . Mai and Johansson (2021) show that, for stochastic convex optimization with quadratic growth, subgradient methods incorporating a clipping stepsize still ensure convergence, even if the Lipschitz constant exhibits arbitrary growth. In weakly convex optimization, Li et al. (2023b) shows that when the Moreau envelope of objective has a bounded level-set, local Lipschitz continuity alone is sufficient to ensure convergence of the subgradient method. Nevertheless, extending their analysis to stochastic optimization remains challenging. Grimmer (2019) establish the convergence of normalized subgradient in convex optimization without Lipschitz continuity by considering an upper bound of the form f(x) f(x ) G( x x ) (2) where G is a growth function that allows fast growth of f. In (Zhu et al., 2023), the authors propose a relaxed subgradient bound for weakly convex optimization: E[ f (x, ξ) 2] c0 + c1 x , c0, c1 0, (3) which naturally induces a bound on the local Lipschitzness as a function of x . Whether SGD still converges in case of arbitrary non-Lipschitzness, especially those not conforming to the bounded assumption in (3), remains an open area of investigation. The primary difficulty in analyzing stochastic optimization without the standard Lipschitz assumption stems from stability issues. We consider a stochastic algorithm stable if it produces iterations in a bounded set with a probability greater than 0. Unlike the deterministic case, establishing stability in the face of randomness is not straightforward, especially when dealing with non-convex functions. This challenge motivates exploring an appropriate definition of non-Lipschitzness and the development of efficient algorithms for stochastic weakly convex optimization in this non-standard setting. 1.1. Contributions This paper provides an affirmative answer to the question Can we optimize stochastic weakly convex problems without assuming global Lipschitz continuity? We show that carefully chosen robust stepsizes can effectively adapt to arbitrary non-Lipchitzness in stochastic model-based weakly convex optimization. Our contributions are as follows: 1) When the Lipschitz constant is not uniformly bounded above but instead depends on a general growth function G( ), we design a novel robust adaptive stepsize strategy such that stochastic weakly convex optimization achieves the O(1/ K) convergence rate with a constant failure probability. Our analysis does not assume any specific form of G, such as those implied by (3). To our knowledge, this is the first result of stochastic weakly convex optimization for arbitrary non-Lipschitz objectives. Our analysis applies to a broad class of model-based algorithms (Davis and Drusvyatskiy, 2019; Deng and Gao, 2021), including SGD as a special case. Compared to Davis and Drusvyatskiy (2019), our analysis relaxes the global Lipschitz assumption and makes the model-based framework applicable to a broader range of settings. 2) Even if the growth function G is unknown, we show that achieving the same convergence guarantee is still possible. To this end, we introduce a new robust stepsize based on the concept of reference Lipschitz continuity , which allows us to estimate the Lipschitz parameter of a stochastic model function using local samples. Our algorithm is highly flexible and can be applied to most weakly convex problems of interest. Moreover, our analyses can be extended to solving convex stochastic optimization without Lipschitz continuity. A more detailed discussion is left to Section E. Model-based Optimization Model-based optimization, as proposed by (Davis and Drusvyatskiy, 2019), serves as a general framework for analyzing stochastic weakly convex optimization. This framework has been leveraged by several papers (Davis et al., 2018a; Chadha et al., 2022; Deng and Gao, 2021; Gao and Deng, 2024) to obtain convergence rates for a broad class of algorithms. Our analysis also builds on this framework and extends it to several algorithms. Other Related Works Adaptive stepsize and gradient clipping are two essential tools adopted in our algorithm framework. On the one hand, stepsize selection has been an important topic in stochastic optimization, and it has been justified that adaptive stepsize benefits stochastic firstorder methods both in theory and in practice (Duchi et al., 2011; Kingma and Ba, 2014; Li and Orabona, 2019; Hinton et al., 2012; Defazio and Mishchenko, 2023; Ivgi et al., 2023; Malitsky and Mishchenko, 2023). On the other hand, gradient clipping (Zhang et al., 2019) will be employed as a technique in the paper. In theory, gradient clipping was initially identified as a tool to solve problems with generalized Lipschitz smoothness condition (Li et al., 2023a; Xie et al., 2023; Zhang et al., 2019). Recent works (Gorbunov et al., 2020; Koloskova et al., 2023) show that gradient clipping can effectively deal with problems with heavy-tail noise. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity It is also observed that gradient clipping improves the robustness and stability of SGD (Mai and Johansson, 2020) in stochastic convex optimization. In our analysis, a generalized version of gradient clipping is developed to alleviate the instability arising from stochastic noise. 2. Preliminaries Notations Throughout the paper and , denote the Euclidean inner product and norm. Subdifferential of f is given by f(x) := {v : f(y) f(x)+ v, y x +o( x y ), y x} and f (x) f(x) is called a subgradient. A growth function G( ) is a continuous non-decreasing function mapping from R+ to R+. Model Function and Model-based Optimization Our main algorithm will be presented in a model-based fashion, which encompasses several first-order methods, including the most widely used (proximal) subgradient method. Model-based optimization (Davis and Drusvyatskiy, 2019) contains two components: a stochastic model function and a stepsize (regularization) parameter. In each iteration, we can construct a local approximation of f(x) based on random sample ξk and the current iterate xk. The stochastic function, denoted by fxk( , ξk) + ω(x), is called a model function. Then we take parameter γk and minimize this local approximation under quadratic regularization γk 2 x xk 2 to obtain the next iterate xk+1. Typical models include (Sub)gradient. Given E[f (x, ξ)] = f (x) f(x), fx(y, ξ) = f(x, ξ) + f (x, ξ), y x . Prox-linear. Given f(x, ξ) = h(c(x, ξ)), fx(y, ξ) = h(c(x, ξ) + c(x, ξ), y x ). Truncated. Given a known lower-bound of model ℓ, fx(y, ξ) = max{f(x, ξ) + f(x, ξ), y x , ℓ}. Algorithm 1 summarizes model-based optimization. Algorithm 1 Stochastic model-based optimization Input x1 for k = 1, 2,... do Sample data ξk and choose regularization γk > 0 xk+1 = arg min x fxk(x, ξk) + ω(x) + γk We see that 1) model function fx( , ξ); 2) regularization parameter γk are two core components for our algorithm design. Throughout this paper, we show how properly chosen γk improves convergence beyond Lipschitz continuity. We start by making assumptions. Envelope Smoothing Our analysis adopts the Moreau envelope as the potential function for weakly convex optimization. Let f be a λ-weakly convex function. Given ρ > λ, the Moreau envelope and the associated proximal mapping of f are given by f1/ρ(x) := min y f(y) + ρ proxf/ρ(x) := arg min y Moreau envelope can be interpreted as a smooth approximation of the original function. f1/ρ(x) is differentiable with gradient f1/ρ(x) = ρ(x proxf/ρ(x)) If f1/ρ(x) ε, then x is in the proximity of a near stationary point proxf/ρ(x). An important observation is that the existence of f1/ρ(x) relies on weak convexity instead of Lipschitz continuity of f. Assumptions. We make the following assumptions. A1: It is possible to generate i.i.d. samples {ξk}. A2: ω(x) is κ-weakly convex and Lω-Lipschitz continuous for all x dom ω. A3: Eξ[fx(x, ξ)] = f(x) for all x dom ω and Eξ[fx(y, ξ) f(y)] τ for all x, y dom ω. Moreover, fx(y, ξ) is convex for all x, y dom ω and ξ Ξ. A4: ψ1/ρ(x) is lower bounded by Λ 0. A5: The v-level-set of ψ1/ρ(x): Lv = {x : ψ1/ρ(x) v} has a bounded diameter diam(Lv) Bv < . Remark 1. Given A1 to A3, it follows that f(x) is τ-weakly convex and ψ(x) is (τ + κ) weakly convex. Remark 2. In A3, we adopt a convex model function since all the models considered in this paper are convex. But the result directly generalizes to weakly convex model functions: for example, if fx(x, ξ) is λ-weakly convex, then fx(x, ξ) + ω(x) = fx(x, ξ) + λ 2 x 2 + ω(x) λ and we can redefine ω and fx(x, ξ) to push weak convexity to the proximal term. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Remark 3. A5 is a key assumption in dealing with non Lipschitzness. It implies that by bounding the Moreau envelope as the potential function, we can control the Lipschitzness as a function of x. Typically, A5 is satisfied when ψ is coercive (Li et al., 2023b), which is natural in our non-Lipschitz (e.g., a high-order polynomial) context, as the function growth is faster than linear. It s important to note that non-Lipschitzness can also arise in other cases, such as in the interpolation setting, which may require different analyses (Li et al., 2023b). Structure of the Paper The paper is organized as follows. Section 3 discusses the convergence of weakly convex optimization with the standard Lipschitz condition, which serves as a benchmark to provide sufficient intuition for the algorithm design in more challenging scenarios. Section 4 and 5 discuss two cases where the standard Lipschitz condition fails to hold. Section 6 conducts numerical experiments to verify our results. 3. Optimization of Standard Lipschitzness The results in this section are already available in the literature (Davis and Drusvyatskiy, 2019), and the goal is to provide a benchmark result and establish some basic intuitions. We assume that B1: For any x, y and ξ Ξ, fx(x, ξ) fx(y, ξ) Lf(ξ) x y and E[Lf(ξ)2] L2 f. B1 is common in nonsmooth optimization, and the following descent property is available. Lemma 3.1. Let ˆxk = proxψ/ρ(xk). Suppose that A1 to A3 as well as B1 holds, then given ρ > κ + τ, γk > ρ, Ek[ψ1/ρ(xk+1)] (5) ψ1/ρ(xk) ρ(ρ τ κ) 2(γk κ) ˆxk xk 2 + 2ρL2 f (γk ρ)(γk κ) where Ek[ ] := E[ |ξ1, . . . , ξk] denotes the conditional expectation taken over ξ1, . . . , ξk. γk is generally much larger than the other constants in the algorithm, and Lemma 3.1 reveals the following relation Ek[ψ1/ρ(xk+1)] ψ1/ρ(xk) O(γ 1 k ) ψ1/ρ(xk) 2 + O(L2 fγ 2 k ), (6) where O(L2 fγ 2 k ) characterizes the error from both stochastic noise and nonsmoothness. Taking γk O( K) and telescoping Lemma 3.1, we get the convergence result. Theorem 3.1. Under the same conditions as Lemma 3.1, if we take γk ρ + κ + α K, then we have min 1 k K E[ ψ1/ρ(xk) 2] 2ρ ρ τ κ h ρD where D = ψ(x1) infx ψ(x). Theorem 3.1 is standard in the literature (Davis and Drusvyatskiy, 2019). One important intuition we want to establish is that the choice of γk O( K) is a consequence of the following trade-off: suppose we telescope over (6) directly, then 1 K PK k=1 O(γ 1 k )E[ ψ1/ρ(xk) 2] K PK k=1 O(L2 fγ 2 k ). First, we need large γk, in other words, a conservative stepsize (large γ), such that the error of potential reduction O(P k L1 fγ 2 k ) is properly bounded. Meanwhile, large γ also leads to the amount of potential reduction O(γ 1 k ) ψ1/ρ(xk) 2 being small. Finally, the optimal trade-off is γk O( K) and an O(1/ K) rate of convergence. As we will show in the following sections, with non-Lipschitzness, the error O(L2 fγ 2 k ) cannot be bounded by choosing some constant γk. What we do is adaptively find suitable and robust γk, to reduce the error. Using A5 and a probabilistic analysis, we achieve this goal without compromising the convergence rate. 4. Optimization of Generalized Lipschitzness Before achieving our goal of solving non-Lipschitz weakly convex optimization problems, we start from a less challenging case characterized as follows. C1: For all x, y, z and ξ Ξ, fx(y, ξ) fx(z, ξ) Lf(ξ)G( x ) y z and E[Lf(ξ)2] L2 f; Recall that growth function G is monotonically increasing. This assumption implies that our model function is globally Lipschitz, but the Lipschitz constant has some known dependency on the norm of the model s expansion point x. Our analysis also applies if we can estimate a non-trivial upper bound of G. But for the brevity of analysis, we take this upper bound to be G itself. Many real-life applications have this structure, especially if the source of non Lipschitzness is a high-order polynomial. Example 4.1 (Phase retrieval). Consider f(x, ξ) = | a, x 2 b|, a Rn, b R+. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity The subgradient model fx(y, ξ) = f(x, ξ) + f (x, ξ), y x = f(x, ξ) + 2 sign( a, x b) a, x a, y x satisfies fx(y, ξ) fx(z, ξ) 2 a 2 x y z . Example 4.2 (Subgradient with known growth). SGD corresponds to the model function fx(y, ξ) = f(x, ξ) + f (x, ξ), y x . Then, C1 is satisfied if f (x, ξ) Lf(ξ)G( x ). It follows that E[ f (x, ξ) 2] L2 f G2( x ). Particularly, (3) corresponds to the case of G2( ) being a linear function. One direct consequence of C1 is that the Lipschitz constant of fx( , ξ) can go to when x . Moreover, we cannot directly rely on Lipschitzness of f(x). Taking subgradient update as an example, this implies f (x, ξ) can have a large norm, leading to a higher chance of divergence. From the perspective of convergence analysis, the error term O(L2 fγ 2 k ) in (6) becomes hard to bound, and Lemma 4.1 quantifies this hardness. Lemma 4.1. Suppose A1 to A3 as well as C1 holds, then given ρ > κ + τ, γk > ρ, Ek[ψ1/ρ(xk+1)] ψ1/ρ(xk) ρ(ρ τ κ) 2(γk κ) ˆxk xk 2 + ρ 2γk(γk κ)(G( xk )Lf + Lω)2 where γk is independent of ξk. According to Lemma 4.1, the error of potential reduction involves the norm of xk. Since G( xk ) is not necessarily bounded above, the previous constant stepsize analysis is not applicable. As a natural fix, we can take γk = O G( xk ) to reduce the error. However, according to the trade-off we previously mentioned, unless G( xk ) is bounded by some constant independent of K, the reduction in the potential function can be arbitrarily small, and we still cannot obtain a convergence rate. To resolve this issue, we essentially need to show the boundedness of { xk }, and our solution is to associate the boundedness of { xk } with another bounded quantity during the algorithm: E[ψ1/ρ(xk)]. Intuitively, since E[ψ1/ρ(xk)] is reduced by the algorithm, it remains bounded on expectation, and from A5 we know that boundedness of ψ1/ρ(xk) implies boundedness of xk , giving bounded G( xk ) and the O(1/ K) rate we want. The following asymptotic result confirms our intuition. Theorem 4.1. Under the same conditions as Lemma 4.1 and A4, A5, if γk = ρ + κ + (G( xk ) + 1)kζ, ζ ( 1 2, 1), then as k , { xk } is bounded with probability 1; Moreover, the sequence {infj k ψ1/ρ(xj) } converges to 0 almost surely. While it is relatively easy to show asymptotic convergence, we need a more careful analysis of the algorithm behavior to obtain a finite-time convergence rate. One significant difficulty is that the boundedness of E[ψ1/ρ(xk)] is may not directly provide information of xk , since this relation holds only on expectation. To deal with this issue, we resort to probabilistic tools and establish a new probabilistic argument in the following subsection. 4.1. Stability of the Iterations In this subsection, we aim to analyze the stability of the iterates of a stochastic algorithm on a non-Lipschitz function. The intuition is straightforward: if a stochastic algorithm reduces some potential function that has a bounded level-set, then the iterates will stay in a bounded region with high probability. We provide the basic proof sketch and leave a more rigorous argument in the appendix. Our analysis relies on two simple facts that we gain from the robust stepsize. Lemma 4.2 (Informal). Under the same conditions as Lemma 4.3, if γk = O (G( xk ) + 1) K , then we have, for all k = 2, . . . , K, that E[ xk+1 xk ] O( 1 E[ψ1/ρ(xk)] O(1). (8) Relation (7) says that with our robust stepsize strategy, we cautiously explore the feasible region, and at each iteration, we only take a small step of O(1/ K). The second relation (8) comes directly from Lemma 4.1. Indeed, with γk = O (G( xk ) + 1) we have ρ(G( xk )Lf +Lω)2 2γk(γk κ) = O(1/K). And if we telescope Lemma 4.1 and take expectation over all the randomness for k = 2, . . . K, we have E[ψ1/ρ(xk)] Pk j=1 O(1/K) = O(1), k = 2, . . . , K. Each of the two relations alone may not offer us helpful information, as they both hold on expectation. However, when they are combined, a more useful result is available. Our argument is as follows: Consider the event xk is large and we wish to upperbound its probability. We have the following facts: 1. If xk is large, xk+1 xk O(1/ K) by triangle inequality and xk+1 is also large. 2. If xk+1 is large, then ψ1/ρ(xk+1) is large by A5. 3. E[ψ1/ρ(xk+1)] is bounded by some constant. In other words, conditioned on the event xk is large , to ensure that E[ψ1/ρ(xk+1)] is still bounded, either Case Stochastic Weakly Convex Optimization beyond Lipschitz Continuity 1): the event happens with low probability, or Case 2): xk+1 has to immediately jump back to a bounded region of smaller radius. However, since our robust stepsize restricts the jump between two consecutive iterations, Case 2) cannot happen. Therefore, it is unlikely xk is large . This argument brings us the following tail-bound which characterizes the behavior of xk as a random variable. Lemma 4.3. Under the same conditions as Lemma 4.1 as well as A4, A5, if we take γk = ρ + κ + τ + α(G( xk ) + 1) K, then the tail bound P xk Ba + 4(Lf +Lω) holds for all 2 k K, where = ψ1/ρ(x1) + Λ + ρ(Lf +Lω)2 and recall that diam(La ) Ba . Lemma 4.3 provides a useful characterization of the tail probability on the norm of the iterations. Now that the bound holds for all xk, we can immediately condition on the event that Θ(K) iterations lie in the bounded set to retrieve an O(1/ K) convergence rate. Theorem 4.2. Under the same conditions as Lemma 4.3, given δ (0, 1/4), at least with probability 1 p, p (2δ, 1), (1 2p 1δ)K iterations will be bounded by R(δ) = Bδ 1 + 4(Lf +Lω) and conditioned on these iterations, min 1 k K E[ ψ1/ρ(xk) 2] p M p 2δ( ρ+τ+κ K + α(Gδ+1) where M = 2ρ ρ τ κ D + ρ 2α2 (Lf + Lω)2 and Gδ := maxz R(δ) G(z). Theorem 4.2 shows that, with constant probability, we retrieve O(1/ K) convergence rate after K iterations. This probability argument can be further improved, for example, by running the algorithm independently multiple times (Davis and Grimmer, 2019). The analysis in this section serves as a step-stone for the next section, where we deal with non-Lipschitz optimization without knowing G( ). 5. Optimization of unknown Lipschitzness While the analysis in Section 4 extends the solvability of weakly convex optimization to non-Lipschitz functions, it relies on the knowledge of an explicit growth function G( x ) to bound local Lipschitzness. However, it is possible that either access to G( x ) is not viable, or the bound lacks a predefined functional form. In these cases, we assume that the growth function is unknown a priori. D1: For all fixed x dom ω, fx(z, ξ) fx(y, ξ) Lip(x, ξ) z y for all y, z; ξ Ξ. Although D1 and C1 look similar, they are fundamentally different. The most direct consequence is that Lip(x, ξ) is sample-dependent, which means any stepsize strategy based on it will introduce bias in the stochastic algorithm. To resolve this issue, our new stepsize policy relies on constructing an estimator of Lip(x, ξ). We introduce the property of admitting a reliable estimation of Lipschitz constant , which we call the reference Lipschitz continuity. 5.1. Reference Lipschitz Continuity Definition 1 (Reference Lipschitz continuity). Stochastic model fx(y, ξ) satisfies reference Lipschitz continuity if given an x, fx( , ξ) is globally Lipschitz with a Lipschitz constant Lip(x, ξ); given ξ, ξ Ξ, Eξ,ξ [|Lip(x, ξ) Lip(x, ξ )|2] σ2 < . The first property, entirely determined by the stochastic model function, is typical, as most model functions are compositions of Lipschitz functions and linear expansions. The second property indicates that the expected difference between models Lipschitzness is bounded by noise parameter σ. This property is also assumed in literature (Mai and Johansson, 2021) when dealing with stochastic subgradient of a function with arbitrary growth. As we will demonstrate in the examples, for most functions it can be deduced from bounded variance assumption. One direct outcome of reference Lipschitz continuity is that we can cheaply estimate Lip(x, ξ) based on Lip(x, ξ ), where ξ and ξ are two independent samples drawn from Ξ. Example 5.1 ((Sub)gradient). fx(y, ξ) = f(x, ξ) + f(x, ξ), y x The model is Lipschitz with Lip(x, ξ) = f(x, ξ) . If E[ f(x) f(x, ξ) 2] σ2, then E[|Lip(x, ξ) Lip(x, ξ )|] E[ f(x, ξ) f(x, ξ ) ] 2σ Even if f is nonsmooth, the property may still hold. One example is the composition problem f(x, ξ) = h(c(x, ξ)), where h is Lh-Lipschitz continuous and c is differentiable. Then f(x, ξ) = c(x, ξ) h(c(x, ξ)). If we estimate the Lipschitz constant with Lh c(x, ξ) , then E[|Lip(fx( , ξ)) Lip(x, ξ )|] 2Lhσ. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Example 5.2 (Proximal linear). fx(y, ξ) = h(c(x, ξ) + c(x, ξ), y x ) When h is Lh Lipschitz continuous, the model is globally Lipschitz with Lip(x, ξ) = Lh c(x, ξ) . If E[ c(x) c(x, ξ) 2] σ2, then E[|Lip(x, ξ) Lip(x, ξ )|] Lh E[ c(x, ξ) c(x, ξ ) ] 2Lhσ Example 5.3 (Truncated model). fx(y, ξ) = max{f(x, ξ) + f(x, ξ), y x , ℓ}. The model is f(x, ξ) -Lipschitz and the reasoning of reference Lipschitz continuity is the same as in Example 5.1. Note that the truncated model encompasses stochastic Polyak stepsize as a special case (Schaipp et al., 2023). In this section, we would assume that our model satisfies the reference Lipschitz continuity. D2: The stochastic model fx( , ξ) satisfies the reference Lipschitz continuity with noise parameter σ. 5.2. Algorithm Design and Analysis As we did in Section 4, before getting down to the algorithm design, we first need to see what happens to our potential reduction in this new setting. Firstly, Lemma 5.1 characterizes the descent property of our potential function under the assumption that γk is independent of ξk. Lemma 5.1. Suppose A1 to A3 as well as D1, D2 hold, then given ρ > κ + τ, Ek[ψ1/ρ(xk+1)] ψ1/ρ(xk) ρ(ρ τ κ) 2(γk κ) ˆxk xk 2 + Ek ρ 2γk(γk κ)(Lip(xk, ξk) + Lω)2 where γk is chosen to be independent of ξk and is considered deterministic here. Compared to Lemma 4.1, bounding the error term ρ(Lip(xk,ξk))+Lω)2 2γk(γk κ) becomes more challenging due to the lack of information about its growth. However, thanks to the reference Lipschitz continuity, by sampling ξ , an independent copy of ξk, we can utilize Lip(xk, ξ ) as a surrogate for Lip(xk, ξk). The preference of Lip(xk, ξ ) over Lip(xk, ξk) is driven by the fact that Lip(xk, ξk) is correlated with xk+1, which significantly complicates the analysis (Andradöttir, 1996). To mitigate the impact of large noise σ on the accuracy of our estimation, we clip the estimator by a threshold α > 0: max{Lip(xk, ξ ), α}. Consequently, we set γk = O(max{Lip(xk, ξ ), α} Remark 4. Our stepsize policy can be seen as a generalization of gradient clipping stepsize to the model-based optimization setting. In particular, when fx(y, ξ) = f(x, ξ) + f(x, ξ), y x and ξ = ξ , we retrieve the clipping subgradient method. The following theorem establishes an asymptotic result and confirms our intuition. Theorem 5.1. Under the same conditions as Lemma 5.1, A4, A5, if γk = ρ + κ + τ + max{Lip(xk, ξ ), α}kζ, ζ ( 1 2, 1), as k , { xk } is bounded with probability 1; {infj k ψ1/ρ(xj) } converges to 0 almost surely. To obtain a non-asymptotic result, we again apply the probabilistic analysis to derive the tail bound. Lemma 5.2. Under the same conditions of Lemma 5.1 as well as A4, A5, if we take γk = ρ + τ + κ + max{Lip(xk, ξ ), α} K, then the tail bound P xk Ba + 4(α+σ+Lω) holds for all 2 k K, where = ψ1/ρ(x1) + Λ + ρ α2 (α + σ + Lω)2 > 0. Theorem 5.2. Assuming the conditions of Lemma 5.2 hold, then given δ (0, 1/4), with probability at least 1 p, p (2δ, 1), (1 2p 1δ)K iterations will lie in the ball with radius R(δ) = Bδ 1 + 4(α+σ+Lω) and conditioned on these iterations, min 1 k K E[ ψ1/ρ(xk) 2] p M p 2δ ρ+τ+κ where M = 2ρ ρ τ κ D + ρ α2 (α + σ + Lω)2 and Gδ := max x R(δ) supξ Ξ Lip(x, ξ). Remark 5. We need to assume Gδ < in the analysis, which can be satisfied for finite sum optimization or when the support of data distribution Ξ is bounded. 6. Experiments In this section, we perform experiments to demonstrate the effectiveness of our proposed methods. We consider the following robust nonlinear regression problem: i=1 |r(x, ai) bi| =: 1 i=1 f(x, ξi), (9) where, given observations {ai} from A Rm n, regression model r(x, a) and target label bi, we aim to fit the model coefficient x given problem data. Table 1 summarizes the regression models and their Lipschitz properties. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Table 1: Nonlinear regression models. r(a, x) = a, x 2 represents the standard robust phase retrieval problem; r(a, x) = a, x 5 + a, x 3 + 1 is a high-order polynomial of a, x ; while e a,x + 10 exhibits exponential growth. Loss r(x, a) f(x, ξ) G( x ) Lip(x, ξ) r1 a, x 2 sign( a, x 2 b) 2 a, x a x 2| a, x | a r2 a, x 5 + a, x 3 + 1 sign(r2 b) (5 a, x 4 + 3 a, x 2)a 5( x 4 + x 2) 5( a 4 x 4 + a 2 x 2) r3 e a,x + 10 sign(e a,x b) e a,x a e A x (A = 3) e a,x a 6.1. Experiment Setup Dataset. We let m = 300, n = 100. Data generation is consistent with (Deng and Gao, 2021), where, given condition number parameter κ 1, we compute A = QD, Q Rm n. Here each element of Q is drawn from N(0, 1); D = diag(d), d Rn, di [1/κ, 1] for all i. Then a true signal ˆx N(0, I) is generated, giving the measurements b by formula bi = r(x, ai). We randomly perturb pfailfraction of the measurements with N(0, 25) noise added to them to simulate data corruption. 1) Dataset. We follow (Deng and Gao, 2021) and set set κ {1, 10} and pfail {0.2, 0.3}. 2) Initial point. We generate x N(0, In) and start from x1 = 10x x for r1, x1 = x x for r2, r3. 3) Stopping criterion. We run algorithms for 400 epochs (K = 400m). Algorithms stop if f 1.2f(ˆx) . 4) Stepsize. We let γk = θ K for vanilla algorithms; γk = θ G( xk ) K for robust stepsize with known growth condition; γk = θ max{Lip(xk, ξ ), α} K for robust stepsize with unknown growth condition. θ [10 2, 101] serves as a hyper-parameter. 5) Clipping. Clipping parameter α is set to 1.0. 6) Mirror descent. For experiments on mirror descent, we use kernels for r1, r2 from (Davis et al., 2018a). We leave the detailed kernel choices to Appendix A. 6.2. Comparing Different Stepsizes Figure 2, 3 and 4 investigate the number of iterations for each stepsize to converge under different choices of θ. As the experiments suggest, our robust choices tend to be conservative when the function exhibits low-order growth. However, when the function exhibits high-order growth, our robust stepsize tends to converge within a reasonable range of stepsizes. It is worth noticing that for problem r2, SGD diverges for θ 108, while our proposed approaches work robustly. Moreover, we notice that our robust stepsize based on reference Lipschitz property never diverges in practice, although it is sometimes conservative on problems where function growth is mild (such as r1). 6.3. Comparison with Mirror Descent Last, we compare our proposed method with the commonly adopted mirror descent approach for non-Lipschitz problems. We test both mirror descent and our proposed SGDbased approaches. As Figure 5 shows, mirror descent indeed often exhibits more stable performance compared to vanilla SGD. However, we see that our approaches still exhibit superior convergence performance. It is important to note that the comparison in terms of iteration counts does not fully capture the efficiency of our approach. Specifically, our method tackles a much simpler proximal subproblem compared to the more complex root-finding subproblem in mirror descent, which further demonstrates the advantage of our robust stepsize strategy. 7. Conclusions We develop novel robust stepsize (regularization) strategies and show that for weakly convex objectives without Lipschitz continuity, stochastic model-based methods can still converge at the desirable O(1/ K) rate with constant failure probability. To our knowledge, this is achieved under the least restrictive assumptions known to date. A promising direction for future research is the adaptation of our analyses to more sophisticated methods, such as momentum-based or adaptive gradient methods. Acknowledgement and Disclosure of Funding The authors are grateful to the Area Chairs and the anonymous reviewers for their constructive comments. This research is partially supported by the Major Program of National Natural Science Foundation of China (Grant 72394360, 72394364). Impact Statement This paper discusses stochastic weakly convex optimization without standard Lipschitz continuity assumptions. We develop methods that can benefit the optimization community and have no negative social impacts. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity 10-2 10-1 100 101 0 SGD SGD-G SGD-R 10-2 10-1 100 101 0 SPL SPL-G SPL-R 10-2 10-1 100 101 0 SGD SGD-G SGD-R 10-2 10-1 100 101 0 SPL SPL-G SPL-R Figure 2: Problem r1. Left two: (κ, pfail) = (10, 0.2); Right two: (κ, pfail) = (10, 0.3). x-axis: parameter θ; y-axis: number of iterations. SGD denotes vanilla SGD; SGD-G denotes SGD robust to known Lipschitzness; SGD-R denotes SGD robust to unknown Lipschitzness. The same applies to SPL. 10-2 10-1 100 101 0 SGD SGD-G SGD-R 10-2 10-1 100 101 0 SPL SPL-G SPL-R 10-2 10-1 100 101 0 SGD SGD-G SGD-R 10-2 10-1 100 101 0 SPL SPL-G SPL-R Figure 3: Problem r2. Left two: (κ, pfail) = (1, 0.2); Right two: (κ, pfail) = (10, 0.3). x-axis: parameter θ; y-axis: number of iterations. 10-2 10-1 100 101 0 SGD SGD-G SGD-R 10-2 10-1 100 101 0 SPL SPL-G SPL-R 10-2 10-1 100 101 0 SGD SGD-G SGD-R 10-2 10-1 100 101 0 SPL SPL-G SPL-R Figure 4: Problem r3. Left two: (κ, pfail) = (1, 0.2); Right two: (κ, pfail) = (1, 0.3). x-axis: parameter θ; y-axis: number of iterations. 10-2 10-1 100 101 0 SGD SGD-G SGD-R Mirror 10-2 10-1 100 101 0 SGD SGD-G SGD-R Mirror 10-2 10-1 100 101 0 SGD SGD-G SGD-R Mirror 10-2 10-1 100 101 0 SGD SGD-G SGD-R Mirror Figure 5: Left two: Problem r1, (κ, pfail) = (1, 0.3); Right two: Problem r2, (κ, pfail) = (1, 0.3). x-axis: parameter θ; y-axis: number of iterations. Sigrún Andradöttir. A scaled stochastic approximation algorithm. Management Science, 42(4):475 498, 1996. Hilal Asi and John C Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257 2290, 2019. Karan Chadha, Gary Cheng, and John Duchi. Accelerated, optimal and parallel: Some results on model-based stochastic optimization. In International Conference on Machine Learning, pages 2811 2827. PMLR, 2022. Vasileios Charisopoulos, Yudong Chen, Damek Davis, Mateo Díaz, Lijun Ding, and Dmitriy Drusvyatskiy. Lowrank matrix recovery with composite optimization: good Stochastic Weakly Convex Optimization beyond Lipschitz Continuity conditioning and rapid convergence. Foundations of Computational Mathematics, 21(6):1505 1593, 2021. Damek Davis and Dmitriy Drusvyatskiy. Stochastic modelbased minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019. Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. SIAM Journal on Optimization, 29(3): 1908 1930, 2019. Damek Davis, Dmitriy Drusvyatskiy, and Kellie J Mac Phee. Stochastic model-based minimization under high-order growth. ar Xiv preprint ar Xiv:1807.00255, 2018a. Damek Davis, Dmitriy Drusvyatskiy, Kellie J Mac Phee, and Courtney Paquette. Subgradient methods for sharp weakly convex functions. Journal of Optimization Theory and Applications, 179:962 982, 2018b. Damek Davis, Dmitriy Drusvyatskiy, and Vasileios Charisopoulos. Stochastic algorithms with geometric step decay converge linearly on sharp functions. ar Xiv preprint ar Xiv:1907.09547, 2019. Aaron Defazio and Konstantin Mishchenko. Learning-ratefree learning by d-adaptation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7449 7479. PMLR, 23 29 Jul 2023. Qi Deng and Wenzhi Gao. Minibatch and momentum model-based methods for stochastic weakly convex optimization. Advances in Neural Information Processing Systems, 34:23115 23127, 2021. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. John C Duchi and Feng Ruan. Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval. Information and Inference: A Journal of the IMA, 8(3):471 529, 2019. Wenzhi Gao and Qi Deng. Delayed algorithms for distributed stochastic weakly convex optimization. Advances in Neural Information Processing Systems, 36, 2024. Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Advances in Neural Information Processing Systems, 33:15042 15053, 2020. Benjamin Grimmer. Convergence rates for deterministic and stochastic subgradient methods without lipschitz continuity. SIAM Journal on Optimization, 29(2):1350 1365, 2019. Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14 (8):2, 2012. Maor Ivgi, Oliver Hinder, and Yair Carmon. Dog is sgd s best friend: A parameter-free dynamic step size schedule. ar Xiv preprint ar Xiv:2302.12022, 2023. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. ar Xiv preprint ar Xiv:2305.01588, 2023. Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin, and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. ar Xiv preprint ar Xiv:2306.01264, 2023a. Xiao Li, Lei Zhao, Daoli Zhu, and Anthony Man-Cho So. Revisiting subgradient method: Complexity and convergence beyond lipschitz continuity. ar Xiv preprint ar Xiv:2305.14161, 2023b. Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In The 22nd international conference on artificial intelligence and statistics, pages 983 992. PMLR, 2019. Haihao Lu. relative continuity for non-lipschitz nonsmooth convex optimization using stochastic (or deterministic) mirror descent. INFORMS Journal on Optimization, 1(4):288 303, 2019. Vien Mai and Mikael Johansson. Convergence of a stochastic gradient method with momentum for non-smooth non-convex optimization. In International conference on machine learning, pages 6630 6639. PMLR, 2020. Vien V Mai and Mikael Johansson. Stability and convergence of stochastic gradient clipping: Beyond lipschitz continuity and smoothness. In International Conference on Machine Learning, pages 7325 7335. PMLR, 2021. Yura Malitsky and Konstantin Mishchenko. Adaptive proximal gradient method for convex optimization. ar Xiv preprint ar Xiv:2308.02261, 2023. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics, pages 233 257. Elsevier, 1971. Fabian Schaipp, Robert M. Gower, and Michael Ulbrich. A stochastic proximal polyak step size. Transactions on Machine Learning Research, 2023. ISSN 28358856. URL https://openreview.net/forum? id=j Wr41hta B3. Reproducibility Certification. Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy gradient in robust mdps with global convergence guarantee. In International Conference on Machine Learning, pages 35763 35797. PMLR, 2023. Chenghan Xie, Chenxi Li, Chuwen Zhang, Qi Deng, Dongdong Ge, and Yinyu Ye. Trust region methods for nonconvex stochastic optimization beyond lipschitz smoothness. ar Xiv preprint ar Xiv:2310.17319, 2023. Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. ar Xiv preprint ar Xiv:1905.11881, 2019. Siqi Zhang and Niao He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. ar Xiv preprint ar Xiv:1806.04781, 2018. Daoli Zhu, Lei Zhao, and Shuzhong Zhang. A unified analysis for the subgradient methods minimizing composite nonconvex, nonsmooth and non-lipschitz functions. ar Xiv preprint ar Xiv:2308.16362, 2023. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Table of Contents A Choice of Mirror Descent Kernel 13 B Proof of Results in Section 3 13 B.1 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B.2 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 C Proof of Results in Section 4 14 C.1 Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 C.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 C.3 Proof of Lemma 4.2 and 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 C.4 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 D Proof of Results in Section 5 18 D.1 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 D.2 Proof of Lemma 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 D.3 Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 D.4 Proof of Lemma 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 D.5 Proof of Theorem 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 E Stochastic Convex Optimization beyond Lipschitz Continuity 22 E.1 Convex Optimization under Standard Lipschitzness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 E.2 Convex Optimization under Generalized Lipschitzness . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 E.3 Convex Optimization under Unknown Lipschitzness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 E.4 Proof of Results in Subsection E.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 E.5 Proof of Results in Subsection E.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 E.6 Proof of Results in Subsection E.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Structure of the appendix The appendix is organized as follows. In Section B, Section C and Section D, we respectively prove the results for weakly convex optimization under different Lipschitz continuity assumptions. In Section E and its subsections, we extend our results to convex optimization. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity A. Choice of Mirror Descent Kernel For r(x, a) = a, x 2, we have f (x, a) 2 a, x a 2 a 2 x and the Bregman divergence kernel is taken to be d(x) = 1 For r(x, a) = a, x 5 + a, x 3 + 1, we have |f (x, a)| (5 a, x 4 + 3 a, x 2)a 5( a 5 x 4 + a 3 x 2) and the divergence kernel is taken to be d(x) = 1 10 x 10. Note that to ensure strong convexity we always keep 1 2 x 2 in the kernel. B. Proof of Results in Section 3 B.1. Proof of Lemma 3.1 By the optimality conditions of the proximal subproblems (4), we have, for any ξk Ξ, that fxk(xk+1, ξk) + ω(xk+1) + γk 2 xk+1 xk 2 fxk(ˆxk, ξk) + ω(ˆxk) + γk 2 ˆxk xk 2 γk κ 2 xk+1 ˆxk 2 f(ˆxk) + ω(ˆxk) + ρ 2 ˆxk xk 2 f(xk+1) + ω(xk+1) + ρ 2 xk+1 xk 2 Summing over the above two relations, we deduce that 2 xk+1 xk 2 γk ρ 2 ˆxk xk 2 + γk κ 2 xk+1 ˆxk 2 f(xk+1) fxk(xk+1, ξk) + fxk(ˆxk, ξk) f(ˆxk) Conditioned on ξ1, . . . , ξk 1 and taking expectation with respect to ξk, we have 2 Ek[ xk+1 xk 2] γk ρ 2 ˆxk xk 2 + γk κ 2 Ek[ xk+1 ˆxk 2] Ek[f(xk+1) fxk(xk, ξk)] + Ek[L(ξk) xk+1 xk ] + τ 2 ˆxk xk 2 (10) = Ek[f(xk+1)] f(xk) + Ek[L(ξk) xk+1 xk ] + τ 2 ˆxk xk 2 (11) Lf Ek[ xk+1 xk ] + Ek[L(ξk) xk+1 xk ] + τ 2 ˆxk xk 2 (12) where (10) uses Lf(ξ)-Lipschitzness of fxk(x, ξ); (11) uses quadratic bound from A3 and (12) applies Lf-Lipschitzness of f(x) (Davis and Drusvyatskiy, 2019). Re-arranging the terms, we have 2 Ek[ xk+1 ˆxk 2] 2 ˆxk xk 2 γk ρ 2 Ek[ xk+1 xk 2] + Ek[(L(ξk) + Lf) xk+1 xk ] 2 ˆxk xk 2 + 2L2 f γk ρ (13) 2 ˆxk xk 2 ρ τ κ 2 ˆxk xk 2 + 2L2 f γk ρ where (13) uses the relation a 2x2 + bx b2 2a and Eξ[(L(ξ) + Lf)2] 4L2 f. Dividing both sides by γk κ Ek[ xk+1 ˆxk 2] ˆxk xk 2 ρ τ κ γk κ ˆxk xk 2 + 4L2 f (γk ρ)(γk κ) Stochastic Weakly Convex Optimization beyond Lipschitz Continuity and the potential function is reduced by Ek[ψ1/ρ(xk+1)] = min x {f(x) + ω(x) + ρ 2 x xk+1 2} f(ˆxk) + ω(ˆxk) + ρ 2 ˆxk xk+1 2 f(ˆxk) + ω(ˆxk) + ρ 2 ˆxk xk 2 ρ(ρ τ κ) 2(γk κ) ˆxk xk 2 + 2ρL2 f (γk ρ)(γk κ) = ψ1/ρ(xk) ρ(ρ τ κ) 2(γk κ) ˆxk xk 2 + 2ρL2 f (γk ρ)(γk κ), which completes the proof. B.2. Proof of Theorem 3.1 Given fixed stepsize γk γ = ρ + κ + α K, where we have, after telescoping, that k=1 E[ ˆxk xk 2] ψ1/ρ(x1) E[ψ1/ρ(x K+1)] + 2ρL2 f K (γ ρ)(γ κ). Re-arranging the terms and summing over k = 1, . . . , K, we have min 1 k K E[ ψ1/ρ(xk) 2] 2ρ ρ τ κ K + 2ρL2 f γ ρ K + 2ρL2 f α where D = ψ1/ρ(x1) infx ψ(x) ψ1/ρ(x1) E[ψ1/ρ(x K)] and this completes the proof. C. Proof of Results in Section 4 For brevity of notation, in the proof we define Gk := G( xk ) (14) and use them interchangeably in this section. We also note that Gk is a random variable whose randomness comes from samples from previous iterations ξ1, . . . , ξk 1. C.1. Proof of Lemma 4.1 Firstly, we still use optimality condition to get, for a given ξk, that fxk(xk+1, ξk) + ω(xk+1) + γk 2 xk+1 xk 2 fxk(ˆxk, ξk) + ω(ˆxk) + γk 2 ˆxk xk 2 γk κ 2 xk+1 ˆxk 2 f(ˆxk) + ω(ˆxk) + ρ 2 ˆxk xk 2 f(xk) + ω(xk) Summing over the above two relations, we deduce that γk 2 xk+1 xk 2 γk ρ 2 ˆxk xk 2 + γk κ 2 xk+1 ˆxk 2 f(xk) fxk(xk+1, ξk) + fxk(ˆxk, ξk) f(ˆxk) + Lω xk+1 xk (15) f(xk) f(xk, ξk) + fxk(ˆxk, ξk) f(ˆxk) + (Gk Lf(ξk) + Lω) xk+1 xk , (16) where (15) applies Lω-Lipschitz continuity of ω(x); (16) applies C1. Dividing both sides by γk κ 2 and re-arranging the terms, we have xk+1 ˆxk 2 γk ρ γk κ ˆxk xk 2 γk γk κ xk+1 xk 2 + 2 γk κ[f(xk) f(xk, ξk) + fxk(ˆxk, ξk) f(ˆxk) + (Gk Lf(ξk) + Lω) xk+1 xk ] Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Next conditioned on ξ1, . . . , ξk 1, taking expectation with respect to ξk, and recalling that G( xk ), and therefore γk is fixed given ξ1, . . . , ξk 1, we have Ek[ xk+1 ˆxk 2] γk κ ˆxk xk 2 + 2 γk κEk[(Gk Lf(ξk) + Lω) xk+1 xk γk 2 xk+1 xk 2] (17) γk κ ˆxk xk 2 + (Gk Lf + Lω)2 γk(γk κ) (18) = ˆxk xk 2 ρ τ κ γk κ ˆxk xk 2 + (Gk Lf + Lω)2 where (17) applies f(xk) Ek[f(xk, ξk)] = 0, Ek[fxk(ˆxk, ξk)] f(ˆxk) τ 2 xk ˆxk 2; (18) applies the relation a 2x2 + bx b2 2a and E[Lf(ξ)]2 E[Lf(ξ)2] L2 f. Now we can deduce reduction of the potential function by Ek[ψ1/ρ(xk+1)] = min x {f(x) + ω(x) + ρ 2 x xk+1 2} f(ˆxk) + ω(ˆxk) + ρ 2 ˆxk xk+1 2 f(ˆxk) + ω(ˆxk) + ρ 2 ˆxk xk 2 ρ(ρ τ κ) 2(γk κ) ˆxk xk 2 + ρ(Gk Lf + Lω)2 = ψ1/ρ(xk) ρ(ρ τ κ) 2(γk κ) ˆxk xk 2 + ρ(Gk Lf + Lω)2 and this completes the proof. C.2. Proof of Theorem 4.1 First we introduce the following lemma. Lemma C.1 (Robbins-Siegmund (Robbins and Siegmund, 1971)). Let Ak, Bk, Ck and Vk be nonnegative random variables adapted to the filtration Fk and satisfying E[Vk+1|Fk] (1 + Ak)Vk + Bk Ck. Then on the event {P k=1 Ak < , P k=1 Bk < }, there is a random variable V such that Vk a.s. V and P k=0 Ck < almost surely. Now we get down to the proof. Recall that in Lemma 4.1 we have shown that Ek[ψ1/ρ(xk+1)] ψ1/ρ(xk) ρ(ρ τ κ) 2(γk κ) ˆxk xk 2 + ρ(Gk Lf + Lω)2 and we can bound Gk Lf + Lω γk κ = Gk Lf + Lω ρ + τ + kζ(Gk + 1) Gk Lf + Lω kζ(Gk + 1) Lf + Lω Ek[ψ1/ρ(xk+1) + Λ] [ψ1/ρ(xk) + Λ] ρ(ρ κ τ) 2(γk κ) ˆxk xk 2 + ρ 2k2ζ (Lf + Lω)2 Then we invoke Lemma C.1, plugging in the relation Ak = 0, Bk = ρ(Lf + Lω)2 2k2ζ , Ck = ρ(ρ κ τ) 2(γk κ) ˆxk xk 2, Vk = ψ1/ρ(xk) + Λ 0 (20) Then with ζ ( 1 2, 1) P k=1 Bk = ρ(Lf +Lω)2 2k2ζ < we know that {ψ1/ρ(xk) + Λ} ψ1/ρ(x ) + Λ < and that P k=1 ρ(ρ κ τ) 2(γk κ) ˆxk xk 2 < . By A5, xk is bounded with probability 1 and Gk is bounded almost surely. Finally P k=1 1 γk κ = infj k ˆxj xj 0 almost surely and this completes the proof since ˆxk xk = ρ 1 ψ1/ρ(xk) . Stochastic Weakly Convex Optimization beyond Lipschitz Continuity C.3. Proof of Lemma 4.2 and 4.3 Following (19), we first we bound the error of potential reduction by ρ(Gk Lf +Lω)2 2γk(γk κ) ρ(Lf +Lω)2 Then a telescopic sum gives, for all 2 k K that E[ψ1/ρ(xk) + Λ] ψ1/ρ(x1) + Λ + ρ(Lf + Lω)2 Here is a constant that only depends on the initialization of the algorithm. Next we consider one step of the algorithm fxk(xk+1, ξk) + ω(xk+1) + γk 2 xk+1 xk 2 fxk(xk, ξk) + ω(xk) and a re-arrangement gives 2 xk+1 xk 2 fxk(xk, ξk) fxk(xk+1, ξk) + ω(xk) ω(xk+1) (Lf(ξk)Gk + Lω) xk+1 xk , where we use C1 and Lipschitz continuity of ω(x). Dividing both sides by xk+1 xk , we have xk+1 xk 2(Lf(ξk)Gk + Lω) γk 2(Lf(ξk)Gk + Lω) Conditioned on ξ1, . . . , ξk 1 and taking expectation with respect to ξk, we get Ek[ xk+1 xk ] 2Eξk[Lf(ξk)]Gk + 2Lω K 2Lf Gk + 2Lω K 2(Lf + Lω) This completes the proof of Lemma 4.2. By Markov s inequality, we know that, for any 2 k K, the following bound holds Pξk Ξ n xk+1 xk 4(Lf + Lω) ξ1, . . . , ξk 1o 1 and without loss of generality we let Z = 4(Lf +Lω) K , and clearly Z = O(1/ This relation says it s likely that xk and xk+1 are close , and we leverage this intuition to derive a tail-bound on xk . So far, we have the following properties in hand: 1. E[ψ1/ρ(xk)] is bounded by a constant Λ for all k 2. If xk Bv, then ψ1/ρ(xk) v 3. If xk Bv, it s likely that xk+1 Bv O(1/ Our reasoning is as follows: given large a > 1, conditioned on xk Ba , then it is likely that xk+1 Ba since xk+1 xk is likely to be small. And it implies E[ψ1/ρ(xk+1)] a > . However, we know that E[ψ1/ρ(xk+1)] , and this will therefore reversely bound the probability that xk Ba + O(1/ K). We formalize the proof as follows. First recall that by A5, xk Bv implies ψ1/ρ(xk) v. Taking v = a , a > 1, we have xk Ba ψ1/ρ(xk) av. Now we consider the event xk Ba + Z and apply law of total expectation to get E[ψ1/ρ(xk+1) + Λ] = E[ψ1/ρ(xk+1) + Λ| xk Ba + Z] P{ xk Ba + Z} + E[ψ1/ρ(xk+1) + Λ| xk Ba + Z] P{ xk Ba + Z} E[ψ1/ρ(xk+1) + Λ| xk Ba + Z] P{ xk Ba + Z}, (21) Stochastic Weakly Convex Optimization beyond Lipschitz Continuity where (21) uses ψ1/ρ(x) + Λ 0 for all x. Next we consider the expectation E ψ1/ρ(xk+1) + Λ| xk Ba + Z , (22) and successively deduce that E[ψ1/ρ(xk+1) + Λ| xk Ba + Z] = E[ψ1/ρ(xk+1) + Λ| xk Ba + Z, xk+1 xk Z] P{ xk+1 xk Z} + E[ψ1/ρ(xk+1) + Λ| xk Ba + Z, xk+1 xk Z] P{ xk+1 xk Z} 2E[ψ1/ρ(xk+1) + Λ| xk Ba + Z, xk+1 xk Z] (23) 2(a + Λ), (24) where (23) is by Markov s inequality P{ xk+1 xk Z} 0.5 and (24) uses the triangle inequality xk+1 = xk+1 xk + xk xk xk+1 xk xk Z Ba and we recall that xk+1 Ba implies ψ1/ρ(xk+1) a . Chaining the above inequalities, we arrive at 2 P{ xk Ba + Z} Dividing both sides by a +Λ 2 gives the desired tail bound P n xk Ba + Z o 2 a + Λ. Since Λ 0, the bound is nontrivial when a > 2, and this completes the proof. C.4. Proof of Theorem 4.2 Given δ (0, 1/4), we know, from Lemma 4.3 that for 2 k K, P xk Bδ 1 + Z 2 δ 1 + Λ 2δ. Then denote Ik = I xk Bδ 1 +Z , and we have PK k=1 E[1 Ik] 2δK. By Markov s inequality, given p (2δ, 1), k=1 (1 Ik) 2δK 2δK 2p 1δK = p and a re-arrangement gives P PK k=1 Ik K(1 2p 1δ) 1 p. Now we telescope Lemma 4.1 and get k=1 E ρ(ρ κ τ) 2γk ˆxk xk 2 ψ1/ρ(x1) E[ψ1/ρ(x K+1)] + ρ 2α2 (Lf + Lω)2 (25) ψ1/ρ(x1) inf x ψ(x) + ρ 2α2 (Lf + Lω)2, (26) = D + ρ 2α2 (Lf + Lω)2 where (25) uses the previously established bound Ek h ρ(Gk Lf +Lω)2 2γk(γk κ) i ρ(Lf +Lω)2 α2K . (26) uses the fact that ψ1/ρ(x K+1) infx ψ(x). Next we notice, conditioned on the event PK k=1 Ik K(1 2p 1δ) and these iterations (the set {j : Ij = 1}), Stochastic Weakly Convex Optimization beyond Lipschitz Continuity defining Gδ := maxz G(z), z Bδ 1 + Z, that k=1 E ρ(ρ κ τ) 2γk ˆxk xk 2 k {j:Ij=0} E ρ(ρ κ τ) 2γk ˆxk xk 2 + X k {j:Ij=1} E ρ(ρ κ τ) 2γk ˆxk xk 2 k {j:Ij=1} E ρ(ρ κ τ) 2γk ˆxk xk 2 ρ(ρ κ τ) 2(ρ + κ + τ + α(Gδ + 1) K) E[ ˆxk xk 2] (27) ρ(ρ κ τ)(1 2p 1δ)K 2(ρ + κ + τ + α(Gδ + 1) K) min k {j:Ij=1} E[ ˆxk xk 2], (28) where (27) applies the fact that if Ij = 1, xj Gδ; (28) utilizes the fact that |{k : Ij = 1}| K(1 2p 1δ). Putting the inequalities together, we have min k {j:Ij=1} E[ ˆxk xk 2] 2(ρ + κ + τ + α(Gδ + 1) K) ρ(ρ κ τ)(1 2p 1δ)K D + ρ(Lf + Lω)2 = 2(D + ρ(Lf +Lω)2 2α2 ) ρ(ρ κ τ)(1 2p 1δ) K + α(Gδ + 1) Recalling that ˆxk xk = ρ 1 ψ1/ρ(xk) , at least with probability 1 p, min 1 k K E[ ψ1/ρ(xk) 2] min k {k:Ik=1} E[ ψ1/ρ(xk) 2] p p 2δ 2ρ ρ τ κ D + ρ(Lf + Lω)2 K + α(Gδ + 1) and this completes the proof. D. Proof of Results in Section 5 For brevity of expression, we define Lk f := Lip(xk, ξk), L f := Lip(xk, ξ ) (k is hidden when clear from the context) and use them interchangeably. D.1. Auxiliary Lemmas Lemma D.1. Given independent nonnegative random variables X and Y . If EX,Y [|X Y |2] σ2, then EX,Y h X2 max{Y 2,α2} i σ+α α 2, EX,Y h X max{Y 2,α2} i σ α2 + 1 α, EX,Y h X max{Y,α} i σ where α > 0. Proof. For the first relation, we successively deduce that EX,Y h X2 max{Y 2,α2} i = EX,Y h (X Y +Y )2 max{Y 2,α2} i = EX,Y h (X Y )2+Y 2+2Y (X Y ) max{Y 2,α2} i = EX,Y h (X Y )2 max{Y 2,α2} i + EY h Y 2 max{Y 2,α2} i + EX,Y h 2Y (X Y ) max{Y 2,α2} i α2 + 1 + EX,Y h 2Y |X Y | max{Y,α} max{Y,α} i (29) α + 1 = σ+α Stochastic Weakly Convex Optimization beyond Lipschitz Continuity where (29) uses a max{b,c} a c and the last inequality applies 2Y |X Y | max{Y,α} max{Y,α} = 2Y max{Y,α} |X Y | max{Y,α} 2 |X Y | Similarly we can deduce that EX,Y h X max{Y 2,α2} i EX,Y h |X Y | max{Y 2,α2} i + EY h Y max{Y 2,α2} i σ α2 + 1 EX,Y h X max{Y,α} i EX,Y h |X Y | max{Y,α} i + EX,Y h Y max{Y,α} i σ which completes the proof. D.2. Proof of Lemma 5.1 By the optimality condition, we have fxk(xk+1, ξk) + ω(xk+1) + γk 2 xk+1 xk 2 fxk(ˆxk, ξk) + ω(ˆxk) + γk 2 ˆxk xk 2 γk κ 2 xk+1 ˆxk 2 f(ˆxk) + ω(ˆxk) + ρ 2 ˆxk xk 2 f(xk) + ω(xk) and summation over the two relations gives γk 2 xk+1 xk 2 γk ρ 2 ˆxk xk 2 + γk κ 2 xk+1 ˆxk 2 f(xk) fxk(xk+1, ξk) + fxk(ˆxk, ξk) f(ˆxk) + Lω xk+1 xk f(xk) f(xk, ξk) + fxk(ˆxk, ξk) f(ˆxk) + (Lk f + Lω) xk+1 xk , (31) where (31) applies D1. Fixing ξ , we divide both sides by γk κ xk+1 ˆxk 2 γk ρ γk κ ˆxk xk 2 γk γk κ xk+1 xk 2 + 2 γk κ[f(xk) f(xk, ξk) + fxk(ˆxk, ξk) f(ˆxk) + (Lk f + Lω) xk+1 xk ] Conditioned on ξ1, . . . , ξk 1 and taking expectation with respect to ξk, we have Ek[ xk+1 ˆxk 2] γk κ ˆxk xk 2 + 2 γk κEk[(Lk f + Lω) xk+1 xk γk 2 xk+1 xk 2] (32) γk κ ˆxk xk 2 + Ek h 1 γk(γk κ)(Lk f + Lω)2i (33) where (32) again uses f(xk) Ek[f(xk, ξk)] = 0, Ek[fxk(ˆxk, ξk)] f(ˆxk) τ 2 xk ˆxk 2; (33) uses the relation a 2x2 + bx b2 2a. Now we have Ek[ xk+1 ˆxk 2] ˆxk xk 2 ρ τ κ γk κ ˆxk xk 2 + Ek h 1 γk(γk κ)(Lk f + Lω)2i . In view of our potential function, we have Ek[ψ1/ρ(xk+1)] = min x {f(x) + ω(x) + ρ 2 x xk+1 2} f(ˆxk) + ω(ˆxk) + ρ 2 ˆxk xk+1 2 f(ˆxk) + ω(ˆxk) + ρ 2 ˆxk xk 2 ρ(ρ τ κ) 2(γk κ) ˆxk xk 2 + Ek h ρ 2γk(γk κ)(Lk f + Lω)2i = ψ1/ρ(xk) ρ(ρ τ κ) 2(γk κ) ˆxk xk 2 + Ek h ρ 2γk(γk κ)(Lk f + Lω)2i and this completes the proof. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity D.3. Proof of Theorem 5.1 Our reasoning is the same as in Theorem 4.1, and we start by bounding Eξ Ek h (Lk f +Lω)2 γk(γk κ) i . For brevity we omit Ek[ ] and notice that, for γk 2κ, that, (Lk f + Lω)2 γk(γk κ) 2(Lk f + Lω)2 γ2 k = (Lk f)2 γ2 k + 2Lk f Lω γ2 k + L2 ω γ2 k so that we can bound the three terms respectively using Eξ h (Lk f )2 i Eξ h (Lk f )2 max{(L f )2,α2}k2ζ i α + σ where we invoked Lemma D.1 and take X = Lk f, Y = L f and we recall that by reference Lipschitz property D2: E[|Lk f L f|2] σ2. Similarly, we can deduce that Eξ h 2Lk f Lω γ2 k i Eξ h 2Lk f Lω max{αL f ,α2}k2ζ i α + σ and L2 ω γ2 k L2 ω α2k2ζ since γk αkζ. Putting the things together, we have Eξ h (Lk f +Lω)2 i (α + σ)2 + 2Lω(α + σ) + L2 ω α2k2ζ = (α + σ + Lω)2 Ek[ψ1/ρ(xk+1)] ψ1/ρ(xk) Eξ ρ(ρ τ κ) ˆxk xk 2 + ρ(α + σ + Lω)2 Ek[ψ1/ρ(xk+1) + Λ] [ψ1/ρ(xk) + Λ] Eξ ρ(ρ τ κ) ˆxk xk 2 + ρ(α + σ + Lω)2 Invoking Lemma C.1, plugging in the relation Ak = 0, Vk = ψ1/ρ(xk) + Λ 0, Bk = ρ(α+σ+Lω)2 α2k2ζ and Ck = Eξ ρ(ρ τ κ) γk κ ˆxk xk 2. Note that since Eξ ρ(ρ τ κ) γk κ is determined by xk, we can view Ck = g( xk ) ˆxk xk 2 for some function g and the rest of reasoning is the same as in Lemma 4.1. D.4. Proof of Lemma 5.2 Similar to Lemma 4.3 we first bound the error of potential reduction Ek h ρ(Lk f +Lω)2 2γk(γk κ) i , and according to the proof of Lemma 5.1, " ρ(Lk f + Lω)2 ρ(α + σ + Lω)2 Telescoping the relation E[ψ1/ρ(xk+1)] ψ1/ρ(xk) + ρ(α+σ+Lω)2 E[ψ1/ρ(xk) + Λ] ψ1/ρ(x1) + Λ + ρ(α + σ + Lω)2 Next we show that Ek[ xk+1 xk ] is bounded. By the optimality condition we have fxk(xk+1, ξk) + ω(xk+1) + γk 2 xk+1 xk 2 fxk(xk, ξk) + ω(xk) γk κ 2 xk+1 xk 2 2 xk+1 xk 2 fxk(xk, ξk) fxk(xk+1, ξk) + ω(xk) ω(xk+1) (Lk f + Lω) xk+1 xk . Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Dividing both sides by xk+1 xk , we get xk+1 xk 2(Lk f +Lω) γk 2Lk f γk + 2Lω K . Taking expectation on both sides, we have Eξ Ek[Lk f/γk] α+σ K , where we invoke Lemma D.1 with X = Lk f, Y = L f again. Now Ek[ xk+1 xk ] 2(α + σ + Lω) The rest of the reasoning is consistent with Lemma 4.3 up to difference in constants. And we still present them for completeness. Applying Markov s inequality, we know that Pξk,ξ Ξ{ xk+1 xk 4(α+σ+Lω) K ξ1, . . . , ξk 1} 1 By A5, xk Bv implies ψ1/ρ(xk) v. Taking v = a , we have xk Ba ψ1/ρ(xk) av. Without loss of generality, let Z = 4(α+σ+Lω) K > 0, and we condition on the event xk Ba + Z to deduce that E[ψ1/ρ(xk+1) + Λ] = E[ψ1/ρ(xk+1) + Λ| xk Ba + Z] P{ xk Ba + Z} + E[ψ1/ρ(xk+1) + Λ| xk Ba + Z] P{ xk Ba + Z} E[ψ1/ρ(xk+1) + Λ| xk Ba + Z] P{ xk Ba + Z}, (34) where (34) uses ψ1/ρ(x) + Λ 0 for all x. Next we consider the expectation E[ψ1/ρ(xk+1) + Λ| xk Ba + Z] and we successively deduce that E[ψ1/ρ(xk+1) + Λ| xk Ba + Z] = E[ψ1/ρ(xk+1) + Λ| xk Ba + Z, xk+1 xk Z] P{ xk+1 xk Z} + E[ψ1/ρ(xk+1) + Λ| xk Ba + Z, xk+1 xk Z] P{ xk+1 xk Z} P{ xk+1 xk Z}, (35) where (35) is by P{ xk+1 xk Z} 0.5 and that, conditioned on xk+1 xk Z, xk+1 = xk+1 xk + xk xk xk+1 xk xk Z Ba . Chaining the above inequalities, we arrive at P{ xk Ba + Z} Dividing both sides by a +Λ 2 gives the desired tail bound. D.5. Proof of Theorem 5.2 The proof again follows the clue of Theorem 4.2. Recall that Z = 4(α+σ+Lω) K and given δ (0, 1/4), P{ xk Bδ 1 + Z} 2 δ 1 + Λ 2δ. Denoting Ik = I{ xk Bδ 1 + Z}, we have PK k=1 E[1 Ik] 2δK and by Markov s inequality, given p (2δ, 1), we have k=1 (1 Ik) 2δK o 2δK 2p 1δK = p and P{PK k=1 Ik K(1 2p 1δ)} 1 p. Now we telescope over Lemma 5.1 and get Stochastic Weakly Convex Optimization beyond Lipschitz Continuity k=1 E ρ(ρ τ κ) 2γk ˆxk xk 2 ψ1/ρ(x1) E[ψ1/ρ(x K+1)] + ρ(α + σ + Lω)2 ψ1/ρ(x1) inf x ψ(x) + ρ 2α2 (α + σ + Lω)2 = D + ρ 2α2 (α + σ + Lω)2. Next define Gδ := max x sup ξ Ξ Lip(x, ξ) subject to x Bδ 1 + Z, and we have, conditioned on the event PK k=1 Ik K(1 2p 1δ), k=1 E ρ(ρ τ κ) 2γk ˆxk xk 2 k {j:Ij=0} E ρ(ρ τ κ) 2γk ˆxk xk 2 + X k {j:Ij=1} E ρ(ρ τ κ) 2γk ˆxk xk 2 k {j:Ij=1} E ρ(ρ τ κ) 2γk ˆxk xk 2 k {j:Ij=1} E ρ(ρ τ κ) 2(ρ + κ + τ + (α + Gδ) K) ˆxk xk 2 = ρ(ρ τ κ) 2(ρ + κ + τ + (α + Gδ) k {j:Ij=1} E[ ˆxk xk 2] ρ(ρ τ κ)(1 2p 1δ)K 2(ρ + κ + τ + (α + Gδ) K) min k {j:Ij=1} E[ ˆxk xk 2] Re-arranging the terms, we have, at least with probability 1 p, that min 1 k K E[ ψ1/ρ(xk) 2] min k {k:Ik=1} E[ ψ1/ρ(xk) 2] p p 2δ 2ρ ρ τ κ α2 (α + σ + Lω)2i ρ + λ and this completes the proof after re-arrangement. E. Stochastic Convex Optimization beyond Lipschitz Continuity In this section, we consider applying the above mentioned ideas to convex optimization. When dealing with convex optimization problems, instead of relying on Moreau envelope smoothing, we have a better potential function x x directly relevant to distance to optimal set X . This turns out greatly simplifies our assumptions. E1: f(x, ξ) is convex for all ξ Ξ. λ = κ = 0 and τ = 0. It is rather straight-forward to extend our results to convex optimization. And we remark that our analysis is different from (Mai and Johansson, 2021), where the authors focus on subgradient method and assume quadratic growth condition. E.1. Convex Optimization under Standard Lipschitzness Lemma E.1. Suppose that A1 to A3, E1 as well as B1 holds, then given γk > 0 Ek[ xk+1 x 2] xk x 2 2 γk [ψ(xk) ψ(x )] + (Lf + Lω)2 γ2 k , (36) Stochastic Weakly Convex Optimization beyond Lipschitz Continuity where x X is any optimal solution. Theorem E.1. Under the same assumptions as Lemma E.1, if we take γk γ = α min 1 k K E[ψ(xk) ψ(x )] 1 2 h x1 x 2α + (Lf + Lω)2 where x X is an optimal solution. Remark 6. We observe the same trade-off as in weakly convex optimization, where we have, given telescopic sum of (36), that 1 K k=1 O(γ 1 k )E[ψ(xk) ψ(x )] O 1 k=1 O(L2 fγ 2 k ). Compared with weakly convex case k=1 O(γ 1 k )E[ ψ1/ρ(xk) 2] O 1 k=1 O(L2 fγ 2 k ), this resemblance implies our previous analysis for weakly convex optimization is immediately applicable. E.2. Convex Optimization under Generalized Lipschitzness Lemma E.2. Suppose A1 to A3, E1 as well as C1 holds, then given γk > 0, Ek[ xk+1 x 2] xk x 2 2 γk [ψ(xk) ψ(x )] + (G( xk )Lf + Lω)2 where x X is an optimal solution. Theorem E.2. With the same conditions as Lemma E.2, if γk = (G( xk ) + 1)kζ, ζ 1 2, 1 , then as k , { xk } is bounded with probability 1 and {infj k f(xj) f(x )} converges to 0 almost surely. Lemma E.3. Under the same conditions as Lemma E.2, if we take γk = α(G( xk ) + 1) K, then the tail bound P xk x 2(Lf + Lω) holds for all 2 k K, where = x1 x + Lf +Lω Theorem E.3. Under the same conditions as Lemma E.2, given δ (0, 1/4), p (2δ, 1), (1 2p 1δ)K iterations will lie in the ball centered around x with radius R(δ) = δ 1 + 2(Lf +Lω) min 1 k K E[ψ(xk) ψ(x )] p p 2δ Gδ + 1 x1 x 2α + (Lf + Lω)2 where Gδ := maxz G(z), z x δ 1 + 2(Lf +Lω) Remark 7. We note that x is actually arbitrary. Therefore we can take it to be a minimum norm optimal solution to get a tighter bound. E.3. Convex Optimization under Unknown Lipschitzness Lemma E.4. Suppose that A1 to A3, E1 as well as D1, D2 hold, then given γ > 0, Ek[ xk+1 x 2] xk x 2 2 γk [ψ(xk) ψ(x )] + Ek (Lip(xk, ξk)) + Lω)2 where γk is chosen to be independent of ξk and is considered deterministic here. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Theorem E.4. With the same conditions as Lemma E.4, if γk = max{Lip(xk, ξk)), α}kζ, ζ ( 1 2, 1), then as k , { xk } is bounded with probability 1 and {infj k f(xj) f(x )} converges to 0 almost surely. Lemma E.5. Under the same conditions as Lemma E.4, if we take γk = max{Lip(f(xk, ξk), α} K, then the tail bound P xk x 2(α + σ + Lω) holds for all 2 k K, where = x1 x + α+σ+Lω Remark 8. Now that in convex optimization our potential function x x 2 itself already defines a bounded set. The proof of E.5 can also be done based on a conditional probability argument. Theorem E.5. Under the same conditions as Lemma E.4, given δ (0, 1/4), p (2δ, 1), (1 2p 1δ)K iterations will lie in the ball centered around x with radius R(δ) = δ 1 + 2(α+σ+Lω) min 1 k K E[ψ(xk) ψ(x )] p p 2δ Gδ + α x1 x 2α + (α + σ + Lω)2 where Gδ := maxx supξ Ξ Lip(x, ξ), x x δ 1 + 2(α+σ+Lω) E.4. Proof of Results in Subsection E.1 E.4.1. PROOF OF LEMMA E.1 Let x X be an optimal solution to the problem. Then by three-point lemma, we have fxk(xk+1, ξk) + ω(xk+1) + γk 2 xk+1 xk 2 fxk(x , ξk) + ω(x ) + γk 2 xk x 2 γk 2 xk+1 x 2. Re-arranging the terms, we deduce that 2 xk x 2 γk 2 xk+1 xk 2 + fxk(x , ξk) + ω(x ) fxk(xk+1, ξk) ω(xk+1) 2 xk x 2 γk 2 xk+1 xk 2 + (Lf(ξk) + Lω) xk+1 xk (40) + fxk(x , ξk) + ω(x ) f(xk) ω(xk), where (40) applies B1 to get fxk(xk+1, ξk) fxk(xk, ξk) Lf(ξk) xk+1 xk . Dividing both sides by γk xk+1 x 2 xk x 2 xk+1 xk 2 + 2(Lf(ξk) + Lω) γk [f(x , ξk) + ω(x ) f(xk) ω(xk)] + 2 γk [fxk(x , ξ) f(x , ξ)]. Conditioned on x1, . . . , xk and taking expectation with respect to ξk, we successively deduce that Ek[ xk+1 x 2] xk x 2 Ek[ xk+1 xk 2] + Ek[2γ 1 k (Lf(ξk) + Lω) xk+1 xk ] γk [f(x ) + ω(x ) f(xk) ω(xk)] (41) xk x 2 + (Lf + Lω)2 γk [f(x ) + ω(x ) f(xk) ω(xk)] (42) = xk x 2 + (Lf + Lω)2 γk [ψ(x ) ψ(xk)], Stochastic Weakly Convex Optimization beyond Lipschitz Continuity where (41) applies E1 to get Eξk[fxk(x , ξ) f(x , ξ)] 0; (42) uses a 2x2 + bx b2 2a and that E[Lf(ξ)2] L2 f. Re-arranging the terms, we arrive at Ek[ xk+1 x 2] xk x 2 2 γk [ψ(xk) ψ(x )] + (Lf + Lω)2 and this completes the proof. E.4.2. PROOF OF THEOREM E.1 Taking γk γ = α K and telescoping from 1, . . . , K, we have min 1 k K E[ψ(xk) ψ(x )] 1 k=1 E[ψ(xk) ψ(x )] h x1 x 2α + (Lf + Lω)2 and this completes the proof. E.5. Proof of Results in Subsection E.2 E.5.1. PROOF OF LEMMA E.2 Define Gk := G( xk ). Let x X be an optimal solution to the problem. Similarly, we have fxk(xk+1, ξk) + ω(xk+1) + γk 2 xk+1 xk 2 fxk(x , ξk) + ω(x ) + γk 2 xk x 2 γk 2 xk+1 x 2. Re-arranging the terms, for ξk Ξ, we deduce that 2 xk x 2 γk 2 xk+1 xk 2 + fxk(x , ξk) + ω(x ) fxk(xk+1, ξk) ω(xk+1) 2 xk x 2 γk 2 xk+1 xk 2 + (Gk Lf(ξk) + Lω) xk+1 xk (43) + f(x , ξk) + ω(x ) f(xk) ω(xk), where (43) applies convexity. Dividing both sides by γk 2 , we have xk+1 x 2 xk x 2 xk+1 xk 2 + 2(Gk Lf(ξk) + Lω) γk [f(x , ξk) + ω(x ) f(xk) ω(xk)] Next, conditioned on x1, . . . , xk and taking expectation with respect to ξk, we successively deduce that Ek[ xk+1 x 2] xk x 2 Ek[ xk+1 xk 2] + Ek[2γ 1 k (Gk Lf(ξk) + Lω) xk+1 xk ] γk [f(x ) + ω(x ) f(xk) ω(xk)] xk x 2 + (Gk Lf + Lω)2 γk [f(x ) + ω(x ) f(xk) ω(xk)] (44) = xk x 2 + (Gk Lf + Lω)2 γk [ψ(x ) ψ(xk)], Stochastic Weakly Convex Optimization beyond Lipschitz Continuity where (44) again uses a 2x2 + bx b2 2a and the assumption Eξ[Lf(ξ)2] L2 f. Re-arranging the terms, we get Ek[ xk+1 x 2] xk x 2 2 γk [ψ(xk) ψ(x )] + (Gk Lf + Lω)2 and this completes the proof. E.5.2. PROOF OF THEOREM E.2 Now that our recursive potential reduction is changed into Ek[ xk+1 x 2] xk x 2 2 γk [ψ(xk) ψ(x )] + (Gk Lf + Lω)2 and we bound Gk Lf + Lω γk = Gk Lf + Lω (Gk + 1)kζ Lf + Lω Ek[ xk+1 x 2] xk x 2 2 γk [ψ(xk) ψ(x )] + (Lf + Lω)2 Invoking Lemma C.1 with Ak = 0, Vk = xk x 2 0, Bk = (Lf +Lω)2 k2ζ and Ck = 2 γk [ψ(xk) ψ(x )], we complete the proof with the same argument as in Theorem 4.1. E.5.3. PROOF OF LEMMA E.3 Our proof is a duplicate of Lemma 4.3 using a different potential function. We start by bounding the error of potential reduction by (Gk Lf +Lω)2 γ2 k (Lf +Lω)2 α2K . Then telescoping of (45) gives us, for all 2 k K, that E[ xk x ]2 E[ xk x 2] x1 x 2 + (Lf + Lω)2 α2 x1 x + Lf + Lω and E[ xk x ] . Next consider, by optimality condition, that fxk(xk+1, ξk) + ω(xk+1) + γk 2 xk+1 xk 2 fxk(xk, ξk) + ω(xk) γk 2 xk+1 xk 2. A re-arrangement gives γk xk+1 xk 2 (Lf(ξk)Gk + Lω) xk+1 xk . Dividing both sides by xk+1 xk , xk+1 xk Lf(ξk)Gk + Lω γk = Lf(ξk)Gk + Lω Taking expectation, we get Ek[ xk+1 xk ] Lf +Lω K . Then by Markov s inequality, xk+1 xk 2(Lf + Lω) K |ξ1, . . . , ξk 1 1 and without loss of generality, we let Z = 2(Lf +Lω) K > 0. Then we successively deduce that E[ xk+1 x ] = E[ xk+1 x | xk x Z + z] P{ xk x Z + z} + E[ xk+1 x | xk x Z + z] P{ xk x Z + z} E[ xk+1 x | xk x Z + z]. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Next, consider the expectation E[ xk+1 x | xk x Z + z] and we successively deduce that E[ xk+1 x | xk x Z + z] = E[ xk+1 x | xk x Z + z, xk+1 xk Z] P{ xk+1 xk Z} + E[ xk+1 x | xk x Z + z, xk+1 xk Z] P{ xk+1 xk Z} 2 P{ xk+1 xk Z} (46) where (46) is by P{ xk+1 xk Z} 0.5 and that, conditioned on xk+1 xk Z, xk+1 x = xk+1 xk + xk x xk x xk+1 xk z. Chaining the above inequalities, we arrive at 2 P{ xk x Z + z}. Dividing both sides by z 2 and taking z = a gives the desired tail bound. E.5.4. PROOF OF THEOREM E.3 Given δ (0, 1/4), we have P{ xk x Z + δ 1 } 2 δ 1 2δ. Denote Ik = I{ xk x δ 1 +Z}. We have PK k=1 E[1 Ik] 2δK and by Markov s inequality, given p (2δ, 1), k=1 Ik K(1 2p 1δ) Now we telescope over Lemma E.3 and deduce that γk (ψ(xk) ψ(x )) x1 x 2 E[ x K+1 x 2] + (Lf + Lω)2 x1 x 2 + (Lf + Lω)2 Next we condition on the event PK k=1 Ik K(1 2p 1δ), define Gδ := maxz G(z), z x δ 1 + Z, and successively deduce that γk (ψ(xk) ψ(x )) k {j:Ij=0} E 2 γk (ψ(xk) ψ(x )) + X k {j:Ij=1} E 2 γk (ψ(xk) ψ(x )) k {j:Ij=1} E 2 γk (ψ(xk) ψ(x )) k {j:Ij=1} E 2 α(Gδ + 1) K (ψ(xk) ψ(x )) K α(Gδ + 1) min k {j:Ij=1} E[(ψ(xk) ψ(x ))], (47) Stochastic Weakly Convex Optimization beyond Lipschitz Continuity where (47) follows from the event PK k=1 Ik K(1 2p 1δ). Putting the results together, we have min 1 k K E[ψ(xk) ψ(x )] min k {j:Ij=0} E[ψ(xk) ψ(x )] (48) p p 2δ Gδ + 1 x1 x 2α + (Lf + Lω)2 and this completes the proof. E.6. Proof of Results in Subsection E.3 In this section, we again define Lk f := Lip(xk, ξk), L f := Lip(xk, ξ ) to simplify notation. E.6.1. PROOF OF LEMMA E.3 By the optimality condition we have fxk(xk+1, ξk) + ω(xk+1) + γk 2 xk+1 xk 2 fxk(x , ξk) + ω(x ) + γk 2 xk x 2 γk 2 xk+1 x 2. Re-arranging the terms, we get 2 xk x 2 γk 2 xk+1 xk 2 + fxk(x , ξk) + ω(x ) fxk(xk+1, ξk) ω(xk+1) 2 xk x 2 γk 2 xk+1 xk 2 + (Lk f + Lω) xk+1 xk + f(x , ξk) + ω(x ) f(xk) ω(xk). Dividing both sides by γk xk+1 x 2 xk x 2 xk+1 xk 2 + 2(Lk f + Lω) γk [f(x , ξk) + ω(x ) f(xk) ω(xk)] Conditioned on x1, . . . , xk, taking expectation with respect to ξk, we have Ek[ xk+1 x 2] xk x 2 Ek[ xk+1 xk 2] + Ek[2γ 1 k (Lk f + Lω) xk+1 xk ] γk [f(x ) + ω(x ) f(xk) ω(xk)] (50) xk x 2 + (Lk f + Lω)2 γk [f(x ) + ω(x ) f(xk) ω(xk)] = xk x 2 + (Lk f + Lω)2 γk [ψ(x ) ψ(xk)], where (50) use the fact that γk does not inherit randomness from ξk. Re-arranging the terms, we get Ek[ xk+1 x 2] xk x 2 2 γk [ψ(xk) ψ(x )] + Ek h (Lk f +Lω)2 and this completes the proof. Stochastic Weakly Convex Optimization beyond Lipschitz Continuity E.6.2. PROOF OF THEOREM E.4 We start by bounding Ek h (Lk f +Lω)2 i and notice that (Lk f + Lω)2 γ2 k = Lk f γ2 k + 2Lk f Lω γ2 k + L2 ω γ2 k so that we can bound Eξ h Lk f γ2 k i = Eξ h (Lk f )2 max{(L f )2,α}k2ζ i α + σ where we invoked Lemma D.1 and take X = Lk f, Y = L f, and we recall that by D2, E[|Lk f L f|2] σ2. Similarly, we can deduce that Eξ h 2Lk f Lω γ2 k i Eξ h 2Lk f Lω max{L f ,α}2 i Eξ h 2Lk f αL f Eξ h L2 ω γ2 k i L2 ω α2k2ζ . Putting the bounds together, Eξ h (Lk f +Lω)2 i (α+σ+Lω)2 Then we have Ek[ xk+1 x 2] xk x 2 2 γk [ψ(xk) ψ(x )] + (α + σ + Lω)2 and we complete the proof by invoking Lemma C.1. E.6.3. PROOF OF LEMMA E.5 We start by bounding Ek h (Lk f +Lω)2 i (α+σ+Lω)2 α2K . Telescoping the relation Ek[ xk+1 x 2] xk x 2 2 γk [ψ(xk) ψ(x )] + (α + σ + Lω)2 gives, for all 2 k K, that E[ xk x 2] x1 x 2 + (α + σ + Lω)2 α2 x1 x + α + σ + Lω and E[ xk x ] . Next consider fxk(xk+1, ξk) + ω(xk+1) + γk 2 xk+1 xk 2 fxk(xk, ξk) + ω(xk) γk 2 xk+1 xk 2 and re-arrangement gives γk xk+1 xk 2 (Lk f + Lω) xk+1 xk . Dividing both sides by xk+1 xk , we get xk+1 xk Lk f +Lω γk . Taking expectation, Ek[ xk+1 xk ] α+σ+Lω K . Then by Markov s inequality, Pξk,ξ Ξ n xk+1 xk 2(α+σ+Lω) K |ξ1, . . . , ξk 1o 1 2. and without loss of generality, let Z = 2(α+σ+Lω) By the same reasoning, we arrive at z 2 P{ xk x Z + z} and dividing both sides by z 2 gives the desired tail bound. E.6.4. PROOF OF THEOREM E.5 Following the same argument as Theorem E.4, we have, for δ (0, 1/4), that P{ xk x Z + δ 1 } 2δ. Define Ik = I{ xk x δ 1 + Z}. We have, by Markov s inequality, that P{PK k=1 Ik K(1 2p 1δ)} 1 p. Then telescoping over Lemma E.5 gives γk (ψ(xk) ψ(x )) x1 x 2 E[ x K+1 x 2] + (α + σ + Lω)2 x1 x 2 + α + σ + Lω Stochastic Weakly Convex Optimization beyond Lipschitz Continuity Conditioned on PK k=1 Ik K(1 2p 1δ), we get γk (ψ(xk) ψ(x )) 2(1 2p 1δ) K α + Gδ min k {j:Ij=1} E[ψ(xk) ψ(x )] where Gδ := maxx supξ Ξ Lip(x, ξ), x x δ 1 + Z. Combining two inequalities completes the proof.