# accelerated_stochastic_optimization_methods_under_quasarconvexity__5ae53107.pdf

Accelerated Stochastic Optimization Methods under Quasar-convexity

Qiang Fu 1 Dongchu Xu 2 Ashia Wilson 3

Non-convex optimization plays a key role in a growing number of machine learning applications. This motivates the identification of specialized structure that enables sharper theoretical analysis. One such identified structure is quasar-convexity, a non-convex generalization of convexity that subsumes convex functions. Existing algorithms for minimizing quasar-convex functions in the stochastic setting have either high complexity or slow convergence, which prompts us to derive a new class of stochastic methods for optimizing smooth quasar-convex functions. We demonstrate that our algorithms have fast convergence and outperform existing algorithms on several examples, including the classical problem of learning linear dynamical systems. We also present a unified analysis of our newly proposed algorithms and a previously studied deterministic algorithm.

1. Introduction

Momentum is one of the most widely used techniques for speeding up the convergence rate of optimization methods. Many deterministic and stochastic momentum based algorithms have been proposed for optimizing (strongly) convex functions, e.g. accelerated gradient descent (AGD) (Nesterov, 1983; 2003; Beck & Teboulle, 2009), accelerated stochastic gradient descent (ASGD) (Ghadimi & Lan, 2012; 2016; Kulunchakov & Mairal, 2020), accelerated stochastic variance reduced gradient (ASVRG) methods and their related variants (Nitanda, 2016; Allen-Zhu, 2017; Kulunchakov & Mairal, 2020).

While much of our understanding of modern optimization algorithms relies on the ability to leverage the convexity of the objective function, a growing number of modern ma-

1Sun Yat-sen University, Guangzhou, China 2Harvard University, Cambridge, MA, USA 3MIT, Cambridge, MA, USA. Correspondence to: Qiang Fu <fuqiang7@mail2.sysu.edu.cn>, Ashia Wilson <ashia07@mit.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

chine learning applications rely on non-convex optimization. Unfortunately, the theoretically guaranteed improvement for convex functions that accelerated algorithms have do not apply to many real-world scenarios. For many smooth non-convex optimization problems, we only have guarantees for finding stationary points instead of the global minimizer. However, some non-convex functions involved in several popular optimization problems such as low-rank matrix problems, deep learning and reinforcement learning, have special structure and exhibit convex-like properties (Ge et al., 2016; Bartlett et al., 2018; Mei et al., 2020), which makes it possible to find approximate global minimizers of these structured non-convex functions.

In this paper, we develop two accelerated stochastic optimization methods for optimizing quasar-convex functions. A quasar-convex function is parameterized by a constant γ (0, 1]. γ = 1 implies the function is star-convex, which is a relaxation of convexity (Nesterov & Polyak, 2006). Quasarconvexity was first proposed in Hardt et al. (2016). They prove that the objective of learning linear dynamical systems is quasar-convex under several mild assumptions. Zhou et al. (2019) and Kleinberg et al. (2018) also provide evidence to suggest that loss function of neural networks may conform to star-convexity in large neighborhoods of the minimizers. Several recent papers propose effective deterministic methods for minimizing L-smooth and γ-quasar-convex functions. While gradient descent (GD) and stochastic gradient descent (SGD) need O(γ 1ϵ 1) and O(γ 2ϵ 2) iterations to yield an ϵ-approximate solution Guminov & Gasnikov 2017; Gower et al. 2021, the algorithms developed by Guminov & Gasnikov (2017) and Hinder et al. (2020) need O(γ 1ϵ 1/2) iterations and the algorithm developed by Nesterov et al. (2018) needs O(γ 3/2ϵ 1/2) iterations. Hinder et al. (2020) also introduce a new metric in terms of the total number of function and gradient evaluations. In order to compute an ϵ-approximate solution, the method of Hinder et al. (2020) requires O(γ 1ϵ 1/2 log(γ 1ϵ 1)) total evaluations for γ-quasar-convex functions and O(γ 1κ1/2 log(γ 1κ) log(γ 1ϵ 1))1 total evaluations for µ-strongly γ-quasar-convex functions.

Many optimization problems in machine learning can be

1κ L/µ is the condition number.

Accelerated Stochastic Optimization Methods under Quasar-convexity

expressed in the following format

which makes them particularly well-suited for stochastic optimization methods. When n is large, applying deterministic algorithms will lead to high computational cost due to the full gradient and function value access required per iteration. Therefore, motivated by ASGD, ASVRG in the convex setting as well as the contributions of Hinder et al. (2020), we propose both a quasar-accelerated stochastic gradient descent (QASGD) method and a quasar-accelerated stochastic variance reduced gradient (QASVRG) method for solving (1), where the objective function f is L-smooth and (strongly) quasar-convex. We also present a unified energybased framework to analyze the convergence of these newly proposed accelerated algorithms, drawing inspiration from the unified analyses developed by Wilson et al. (2021) and Kulunchakov & Mairal (2020).

Our principal contributions are three-fold.

QASGD: We introduce QASGD with momentum as the acceleration technique. Under a bounded gradient assumption 2.4, we prove that QASGD achieves convergence rates of O L t2 + σ γ

2 for general quasar-

convex functions and O (1 + γ2/16) t + σ2

γ2t for strongly quasar-convex functions, where ϵ comes from a binary line search. We empirically demonstrate that on learning time-invariant dynamical systems, QASGD outperforms several existing proposed methods.

QASVRG: We introduce QASVRG in a mini-batch setting, which is an extension of Nitanda (2016) to quasar-convexity with momentum as the acceleration technique. Variance reduction and mini-batches are employed to compute the stochastic gradient per iteration. Under an interpolation assumption 2.7 and a compactness assumption 2.5, QASVRG achieves an overall complexity2 of e O n + min n κ γ2 , n κ

e O n + min n LR2

ϵ o for strongly quasarconvex functions and general quasar-convex functions. We also propose an alternative scheme for strongly quasar-convex functions with different parameter choice (Option II in Table 4), whose precise convergence rates are postponed to Theorem 3.6. These two schemes have different dependency on κ and ϵ and thus are suitable to different application scenarios. When n is large, our complexity is significantly lower than the complexity of AGD in Hinder et al. (2020).

2Here we use overall complexity to denote the total number of function and gradient evaluations

Lyapunov analysis: We present a unified analysis for our proposed algorithms under quasar-convexity and smoothness using a standard Lyapunov argument. Additionally, we incorporate the AGD method proposed in Hinder et al. (2020) in our energy-based framework, which we rename QAGD (quasar-accelerated gradient descent). Different from AGD in Hinder et al. (2020), QAGD admits the Bregman divergence which is more general than the Euclidean distance.

The remainder of this paper is organized as follows. Section 2 presents more details about quasar-convexity, related assumptions, and previously proposed methods. Section 3 presents the main algorithms of QAGD, QASGD and QASVRG for (strongly) quasar-convex functions and their convergence analysis. Section 4 describes our simulations verifying the effectiveness of our proposed algorithms.

Notation The following notation is used throughout the paper: Dh(x, y) h(x) h(y) h(y), x y denotes the Bregman divergence between x, y Rd, where h is an arbitrary µ-strongly convex function. [n] {1, 2, ..., n} log+() max{log(), 1}, 2, ak Ak+1 Ak, bk Bk+1 Bk, κ L/ µµ, Ek f(yk) f(x ). a b signifies a = O(b). , represents the inner product. X is the solution set of (1) which we assume is not empty, and a point x is an ϵ-approximate solution if f(x) f(x ) ϵ for x X . R denotes the upper bound of the initial distance such that Dh(x , x0) R2. We assume f(x ) 0 without loss of generality. f is L-smooth, if f(x) f(y) L x y for all x, y Rd. Qµγ, FL respectively denote the set of µ-strongly γ-quasar-convex functions and the set of L-smooth functions, and Qµγ reduces to the set of γquasar-convex functions when µ = 0. We use O( ) to hide constants and e O( ) to hide logarithmic factors and constants.

2. Background

There has been growing interest in exploiting structure present in large classes of non-convex functions. One such structure is quasar-convexity and strong quasar-convexity, defined as follows.

Definition 2.1 (Quasar-convexity). Let γ (0, 1] and let x

be a minimizer of the differentiable function f : Rd R. A function is γ-quasar-convex with respect to x if for all x Rd,

f(x ) f(x) + 1

γ f(x), x x . (2)

For µ > 0, a function is µ-strongly γ-quasar-convex with respect to x if for all x Rd,

f(x ) f(x) + 1

γ f(x), x x + µ

2 x x 2. (3)

Accelerated Stochastic Optimization Methods under Quasar-convexity

2.1. Examples

We introduce a classical example in Hardt et al. (2016) of learning linear dynamical systems (LDS). Consider the following time-invariant linear dynamical system

ht+1 = Aht + Bxt (4a)

yt = Cht + Dxt + ξt, (4b)

where xt R, yt R are the input and output of time t; ξt is a random perturbation sampled i.i.d from a distribution; ht Rd is the hidden state and Θ (A, B, C, D) Rd d Rd 1 R1 d R is the true parameter that we aim to learn. Assuming we have N pairs of training examples S = {(x(1), y(1)), ..., (x(N), y(N))} where each input sequence x RT is sampled from a distribution and y is the corresponding output of the system above, we fit these training examples to the following model

ˆht+1 = ˆAˆht + ˆBxt (5a)

ˆyt = ˆCˆht + ˆDxt, (5b)

which is governed by Θ ( ˆA, ˆB, ˆC, ˆD). According to the training examples and the model system, we consider the following optimization problem

F(ˆΘ) = E{xt},{ξt}

t=1 ˆyt yt 2 #)

Hardt et al. (2016) demonstrate that the objective function F(ˆΘ) is weakly smooth and quasar-convex with respect to Θ under some mild conditions.

We introduce another example of generalized linear models (GLM). Consider the following square loss minimization problem

min f(w) := Ex D

2(σ(w Tx) y)2 (7)

where σ( ) : R R is the link function; x Rd is i.i.d from D and there exists w Rd such that y = σ(w Tx). The quasar-convex structure of f(w) has been exploited in several literature (Foster et al., 2018; Ma, 2020; Wang & Wibisono, 2023).

2.2. Prior Deterministic Methods

Several deterministic first-order methods have been developed to minimize L-smooth γ-quasar-convex functions.

Guminov & Gasnikov (2017) prove that gradient descent achieves a convergence rate of O(L/γt). They also propose an accelerated algorithm achieving a convergence rate of O(L/γ2t2). This algorithm, however, depends on a lowdimensional subspace optimization method at each iteration, which is possibly prohibitively expensive to perform.

Hinder et al. (2020) propose a novel accelerated gradient method achieving a convergence rate of O((1 γ/

2κ)t) for strongly quasar-convex functions. Notably, when γ = 1, this rate matches the convergence rate achieved by Nesterov s AGD for strongly convex functions. The method introduced by Hinder et al. (2020) also achieves a convergence rate of O(L/γ2t2 + ϵ/2) for general quasar-convex functions, which nearly matches the convergence rate achieved by Nesterov s AGD for convex functions when γ = 1. An additional factor ϵ (which can be made arbitrarily small) appears in the convergence rate due to a binary line search subroutine introduced in order to search for the appropriate momentum parameters. Notably, the momentum parameters classically chosen in accelerated gradient descent (Nesterov 1983) are not guaranteed to perform well under quasar-convexity. Compared with the low-dimensional subspace method in Guminov & Gasnikov (2017), the binary line search in Hinder et al. (2020) s AGD achieves at most O(log(γ 1ϵ 1)) function and gradient evaluations, which can be considerably cheaper. Analogously, Bu & Mesbahi (2020) propose momentum-based accelerated algorithms relying on a subroutine but without complexity analysis of the subroutine. Hinder et al. (2020) also establishes a worst case complexity lower bound of Ω(γ 1ϵ 1/2) for any deterministic first-order methods applied to quasar-convex functions, and their methods are optimal up to a logarithmic factor. The complexity of methods in Guminov & Gasnikov (2017) conditionally matches this lower bound.

2.3. Prior Stochastic Methods

While deterministic accelerated methods for quasar-convex functions achieve fast convergence rates and near-optimal complexity, we focus on the development of stochastic methods to reduce the computational complexity when solving (1). When the objective f is γ-quasar-convex, Hardt et al. 2016 show that SGD achieves a convergence rate of O(Γ/γ2t + σ/γ

t) under the assumptions of σ2-bounded variance and Γ-weak-smoothness.

Assumption 2.2 (Bounded Variance). Suppose i is sampled i.i.d from [n]. For some constant σ, we have

Ei fi(x) f(x) 2 σ2.

The weak-smoothness assumption is milder than Lsmoothness. Gower et al. (2021) propose a stochastic gradient method achieving a convergence rate of O(λ2/γ

t) under L-smoothness and Assumption 2.3.

Assumption 2.3 (ER Condition). Suppose i is sampled i.i.d from [n]. For some constants ρ and λ, we have

Ei fi(x) 2 4ρ(f(x) f(x ))+ f(x) 2+2λ2.

Accelerated Stochastic Optimization Methods under Quasar-convexity

Method Assumptions Complexity

GD (Guminov & Gasnikov, 2017) f FL e O n LR2

SGD (Gower et al., 2021) f FL & ER Condition

fi FL & Interpolation

e O (R2+γ2λ2)2

SGD (Jin, 2020) f FL & Bounded Variance e O LR2

QAGD (Hinder et al., 2020) f FL e O n R

QASGD (Ours) fi FL & Bounded Gradient e O R q

QASVRG (Ours) fi FL & Interpolation & Compactness e O n + min n LR2

Table 1. Comparison between some existing methods and our methods when f Q0γ

Compared with Hardt et al. (2016), the smoothness assumption in Gower et al. (2021) is stronger, but the assumption on the gradient estimate is weaker in a sense. Moreover, Gower et al. (2021) demonstrate that this rate can be improved to O(L/γ2t) under Assumption 2.7. Under smoothness and bounded variance, Jin (2020) provides a sharper analysis of SGD compared with Gower et al. (2021) and extends the analysis to the non-smooth setting.

There are several accelerated stochastic methods that can theoretically achieve better worst-case convergence rates than SGD when the objective is convex. In the convex setting, the objective function usually includes a regularizer term, which is convex lower semi-continuous and not necessarily smooth. Ghadimi & Lan (2016) and Kulunchakov & Mairal (2020) propose proximal ASGD which achieves convergence rates of O(L/t2 + σ/

t) and O((1 1/ κ)t+σ2/t) for general convex and strongly convex functions respectively under L-smoothness and the σ2bounded variance assumption. Variance reduction is a powerful technique to achieve a better convergence rate. Allen Zhu (2017) and Kulunchakov & Mairal (2020) propose accelerated proximal SVRG with a convergence rate guarantee of O((1 min{1/

2n})t) and O(Ln/t2) for L-smooth (strongly) convex functions. Nitanda (2016) proposes accelerated mini-batch SVRG methods for minimizing (strongly) convex finite sum without regularizer. This algorithm is a multi-stage scheme achieving convergence rates of e O(n + min{κ, n κ}) and e O(n + min{L/ϵ, n p

L/ϵ}) for L-smooth (strongly) convex functions. By contrast, the methods they use to update the fixed anchor point of SVRG and control the variance are different.

2.4. Motivation

Inspired by the accelerated stochastic methods discussed above, we extend ASGD of Kulunchakov & Mairal (2020) and AMSVRG of Nitanda (2016) to the (strongly) quasarconvex setting under different assumptions. In this subsec-

tion we will discuss these assumptions and how they are compared to prior sets of assumptions. Different from Kulunchakov & Mairal (2020), we do not consider random perturbations of the function value and gradient given x

may not be the global minimizer after perturbation. Furthermore, binary line search (Hinder et al., 2020) is incorporated into each of our proposed methods for finding the appropriate momentum parameters.

For QASGD, our key assumption is the bounded gradient assumption, which is a frequently used assumption in the standard convergence analysis of SGD in the non-convex setting (Hazan & Kale, 2014; Rakhlin et al., 2011; Recht et al., 2011; Nemirovski et al., 2009). Due to the special structure of strongly quasar-convex functions whose gradient is not bounded, we generalize this assumption as follows.

Assumption 2.4 (Bounded Gradient). Suppose f Qµγ and i is sampled i.i.d from [n]. For some σ 0 and x X , we have

Ei fi(x) 2 σ2 + 2µ2 x x 2.

This assumption will reduce to the standard bounded gradient assumption under general quasar-convexity. Compared with ER condition, the bounded gradient assumption is stronger. Example (7) satisfies this assumption (µ = 0) if we choose the link functions to be logistic. For µ > 0, we consider a quasar-convex finite sum f = Pn i=1 fi where x is the minimizer of f and Ei[ fi 2] σ2. Then g(x) = f(x) + µ

2 x x 2 is strongly quasar-convex and satisfies this assumption. While some quasar-convex functions intrinsically do not satisfy Assumption 2.4, it will hold in practice under Assumption 2.5, which was also proposed in Bottou & Le Cun (2005), G urb uzbalaban et al. (2015) and Nitanda (2016) to analyze the incremental and stochastic methods. We summarize the relation of four assumptions above in Remark 2.6.

Accelerated Stochastic Optimization Methods under Quasar-convexity

Assumption 2.5 (Compactness). There exists a compact set C Rd containing iterates generated by some optimization algorithm.

Remark 2.6. The relation of Assumption 2.3 (ER), Assumption 2.2 (BV), Assumption 2.4 (BG) and Assumption 2.5 (Compactness) is illustrated as follows.

Compactness BG BV ER

For QASVRG, we will prove in the next section that the upper bound of the gradient variance introduced by Nitanda (2016) also upper bounds the gradient variance of quasarconvex functions (Proposition 3.4) provided that each fi in problem (1) is Li-smooth and satisfies the following interpolation assumption.

Assumption 2.7 (Interpolation). There exists x X

such that for all i [n]

min x Rd fi(x) = fi(x ).

The interpolation assumption is commonly observed in the over-parameterized machine learning models and has attracted much attention recently (Zhou et al., 2019; Ma et al., 2018; Vaswani et al., 2019; Gower et al., 2021). If the model is sufficiently over-parameterized, it can interpolate the labelled training data completely. Particularly, example (6) satisfies the interpolation assumption when ξt = 0, as Θ is also the global minimizer of the objective function generated by each training example. Example (7) also satisfies this assumption given that y = σ(w T x) for each x D. Let L = maxi{Li}; we assume throughout that each fi is Lsmooth for brevity. Moreover, Nitanda (2016) only presents one parameter choice for both convex and strongly convex functions. In this paper, we provide two parameter choices (Option I and II) under strong quasar-convexity. Similarly, Option II in Table 4 is identical to the parameter choice of general quasar-convex functions. Option I is a slightly different method from the direct extension of Nitanda (2016) s AMSVRG. More technical comparison between these two parameter choices are provided in subsection 3.2.

3. Algorithms

In order to solve (1), QAGD, QASGD and QASVRG need to access the gradient or the gradient estimate from the oracle, which we denote k. In this paper, we consider the following gradient (estimates):

Full Gradient: k = f(xk+1). Problem (1) becomes deterministic.

Stochastic Gradient: k = fi(xk+1) with the

index i sampled i.i.d from [n]. We have Ei[ k] = f(xk+1), where Ei denotes the expectation with respect to the index i.

Mini-batch SVRG: k = f Ik(xk+1) f Ik( x) + f( x), where x is the anchor point fixed per stage; Ik = {i1, i2, ..., ibk} is sampled i.i.d from [n] with f Ik 1 bk Pbk j=1 fij. Batchsize |Ik| = bk. We have EIk[ k] = f(xk+1), where EIk denotes the expectation with respect to the mini-batch Ik.

3.1. Quasar-accelerated Algorithms

We introduce Algorithm 1 as a general framework incorporating QAGD, QASGD and a single stage of QASVRG with different parameter choices. Based on Algorithm 1, we also introduce the multi-stage QASVRG as described in Algorithm 2. Notably, we provide more information about Bisearch (line 3 of Algorithm 1) in the Appendix including the whole algorithm and the corresponding complexity analysis obtained from Hinder et al. (2020) (Algorithm 3, Lemma A.5). The guaranteed performance of our methods relies on the internal assumption 3.1. Relation (8) requires h is µ-strongly convex; relation (9) is a generalization of µ-strongly γ-quasar-convexity using Bregman divergence as the distance, which we will substitute (3) with in the following analysis. When h = 1

2 2, this relation will be identical to (3). Relation (10) holds in the Euclidean setting given that f is µ-strongly γ-quasar-convex with respect to x (Hinder et al. (2020), Corollary 1).

Assumption 3.1. Suppose fi is differentiable for each i [n]. For some µ > 0, 0 < γ 1 and µ 0, and for all x, y Rd, require

2 x y 2, (8)

f(x ) f(x)+ 1

γ f(x), x x +µDh(x , x), (9)

f(x) f(x ) + γµ 2 γ Dh(x , x). (10)

3.2. Convergence Analysis

We develop a unified analysis of QAGD, QASGD and QASVRG using the following Lyapunov function:

Ek Ak(f(yk) f(x )) + Bk Dh(x , zk), (11)

where Ak and Bk are positive non-decreasing sequences that the parameter choices shown in Table 2, Table 3 and Table 4 are highly related to. Based on the convergence rates derived by Lyapunov analysis, we deduce the complexity upper bound of each method in the Euclidean setting (h = 1

2 2). Since the convergence results of QAGD have

Accelerated Stochastic Optimization Methods under Quasar-convexity

Algorithm 1 (Ak, Bk, y0, t, ϵ)

Require: h satisfies Dh(x, y) µ

2 x y 2; f FL; f Qµγ.

1: Initialize x0 = z0 = y0 = y0 and specify θk ( k, αk, βk, ρk, f, b, c, ϵ) 2: for k = 0, ..., t 1 do 3: τk Bisearch( f, yk, zk, b, c, ϵ) (search τk [0, 1] satisfying τk f(dk), yk zk b dk zk 2 c( f(yk) f(dk)) + ϵ with dk := τkyk + (1 τk)zk; see Algorithm 3 for more details) 4: xk+1 (1 τk)zk + τkyk (coupling step) 5: zk+1 arg min z Rd { k, z zk + βk Dh(z, zk) + αk Dh(z, xk+1)} (mirror descent step)

6: yk+1 arg min y Rd ρk k, y xk+1 + 1

2 y xk+1 2 (gradient descent step)

7: end for output yt

Algorithm 2 (Ak, Bk, bk, y0, p, q, ϵ)

Require: Dh(x, y) µ

2 x y 2; fi FL; f Qµγ; q 1

16 1: Initialize y0 = y0 2: for s=0,1,... do

17LDh(x ,ys)

γ2 µqf(ys) , µ = 0 or µ > 0 (Opt II)

2 γq , µ > 0 (Opt I)

4: ys Algorithm 1 (Ak, Bk, y0, t , ϵ) (specify k = f Ik(xk+1) f Ik(y0)+ f(y0) where |Ik| = bk) 5: y0 ys 6: end for output ys

already been established in Hinder et al. (2020), we will not provide the convergence analysis of QAGD in this subsection. Instead, we make convergence analysis of QAGD in Lyapunov framework and obtain results matching Hinder et al. (2020). Relevant proofs and parameter choices are provided in Appendix C and D.

Theorem 3.2 (QASGD). Suppose Assumption 3.1 and Assumption 2.4 hold, Dh(x , z0) R2, fi FL for all i [n] and choose any y0 Rd. Then Algorithm 1 with the choices of k = fi(xk+1) and Ak, Bk, θk specified in Table 3 satisfies

1 + min γ2 µ2

γ2t, µ > 0.

Corollary 3.3. Consider QASGD under the same assumption in Theorem 3.2. Then the overall complexity of QASGD to achieve E [f(yt) f(x )] ϵ is upper bounded by

γ2ϵ2 log+ LR2

γ2 log+ κ3/4

γ2ϵ log+ κ2/3

for µ = 0 and µ > 0 respectively.

The following proposition shows the variance of the gradient estimate of QASVRG reduces as fast as the objective, which is a key technique in our proof to control the stochastic gradient variance. This proposition is also proposed in Nitanda (2016) where they assume fi is convex and smooth. In this paper, we circumvent the convexity of fi by using Assumption 2.7. The proof of Proposition 3.4 is postponed to Appendix B.

Proposition 3.4 (Variance upper bound). Suppose Assumption 2.7 holds, k = f Ik(xk+1) f Ik( x)+ f( x) and fi FL for each i [n], then we obtain the following inequality

EIk k f(xk+1) 2

bk(n 1) (f(xk+1) f(x ) + f( x) f(x )) ,

where |Ik| = bk and EIk denotes the expectation with respect to the mini-batch Ik.

Theorem 3.5 (QASVRG (Single-stage)). Suppose Assumption 3.1 and Assumption 2.7 hold, Dh(x , z0) R2, fi FL for each i [n] and choose any y0 Rd. Then Algorithm 1 with the choices of k = f Ik(xk+1) f Ik(y0) + f(y0) and Ak, Bk, bk, θk specified in Table

Accelerated Stochastic Optimization Methods under Quasar-convexity

4 satisfies

γ + ϵ f(y0), µ = 0,

2 , µ > 0 (Opt I),

γ + ϵ f(y0), µ > 0 (Opt II),

where p γ µ

16 is user specified.

Under Assumption 2.5, {Dh(x , ys)} can be uniformly bounded by some constant if {ys} generated by Algorithm 2 are restricted to a compact set. Thus we can hide the Bregman divergence inside O( ) and e O( ). Note that we only need this assumption when µ = 0. According to Theorem 3.5, single-stage QASVRG is biased which means that an ϵ-approximate solution can not be generated via singlestage QASVRG. For instance, the bias is upper bounded by O p γ + ϵ f(y0) when µ = 0, but the expectation of the optimality gap Et can shrink at each stage with small p and q when t = Ω q

LR2 γ2f(y0) . Consequently we need

O(log(1/ϵ)) stages to generate an ϵ-approximate solution.

Corollary 3.6. Under Assumption 2.5 and the same assumptions in Theorem 3.5, the overall complexity required for Algorithm 2 to achieve E [f(ys) f(x )] ϵ is upper bounded by

O n + n LR2

ϵLR2 log+ L1/2R q1/2γϵ9/14

O n+ nκ γ2n + γ κ log 2

O n + nκ γ2n + γ3/2 κ log+ L1/2R q1/2γϵ9/14

for µ = 0 and µ > 0 (the last two bounds) respectively, where q (0, 1/4].

Proof Sketch The method we use to derive the convergence rates and complexity for each algorithm is unified. We take the difference between each Lyapunov stage, and then take the conditional expectation to obtain the upper bound on E[Ek+1 Ek]. Finally we sum over k to conclude the convergence rates and a subsequent iteration complexity for each algorithm. Combining this complexity with that of Bisearch, we conclude the overall complexity in each corollary. For more details, see Appendix C and D.

The last two bounds of QASVRG correspond to two different choices of parameters in Table 4 (Option I and II). The complexity bound derived with Option I has a more unfavorable dependency on κ while the complexity bound derived with Option II has a more unfavorable dependency on ϵ.

This suggests Option I performs better on well-conditioned problems e.g. κϵ < 1 and Option II performs better on ill-conditioned problems e.g. κϵ 1.

In the complexity bounds of QASGD and QASVRG, extra logarithmic factors are included, which comes from Bisearch. Hinder et al. (2020) prove that the complexity of Bisearch is at most a logarithmic factor given that the function involved in this subroutine is L-smooth. In the stochastic setting, where the functions involved are single fi or a mini-batch of f Ik, we need to assume fi FL for all i due to the uniform sampling.

We summarize our methods and some existing methods in Table 1, including their corresponding assumptions and complexity upper bounds. To summarize, both QASGD and QASVRG achieve better complexity upper bounds than QAGD when n is large, and QASGD enjoys a faster convergence rate and lower complexity than SGD under a stronger assumption. While QASVRG has the potential to be more computationally expensive than SGD due to the full gradient and function value access once a stage, it enjoys a theoretically faster convergence rate than SGD.

4. Simulations

In this section, we evaluate our methods on example (6) in the Euclidean setting using synthetic dataset where each input sequence x(i) N(0, 1) coordinate-wise. Different from Hardt et al. (2016), we generate N training examples and random perturbations ξt before training instead of generating fresh data and random perturbations at each iteration. Thus example (6) can be reformulated as

ˆy(i) t y(i) t 2 #)

where the superscript (i) represents that the output is generated using ith training data (x(i), y(i)). Similar to Hardt et al. (2016), the actual objective in our experiments is F(ˆΘ) = 1 N PN i=1 h 1 T T1 P

t>T1 ˆy(i) t y(i) t 2i , where

T1 = T/4. We generate the true dynamical system and data the same way as in Hardt et al. (2016) using parameters N = 5000, d = 20, T = 500. Following Hinder et al. (2020), we generate the initial iterate ( ˆA0, ˆC0, ˆD0) by perturbing the parameters of the true system and keep the spectral radius of ˆA0 strictly less than 1. We choose the value of random seed to be in {0, 12, 24, 36, 48} for generating five true LDS instances and their initialization. We only present simulation results of {0, 24, 48} in this sections and the remaining results are provided in section F. Note that ˆB is not a trainable parameter since B is known. As is described in Hardt et al. (2016), it is intractable to calculate the precise value of quasar-convexity parameter γ of LDS objective or even estimate it. Thus we evaluate our methods

Accelerated Stochastic Optimization Methods under Quasar-convexity

Figure 1. Evaluation on three different LDS instances. We choose ϵ = 10 2, the stepsize to be 5 10 5, 1 10 6, 1 10 4 for SGD, L = 1 106, 1 108, 1 105 for QASGD and L = 3 104, 1 106, 1 104 for QASVRG in LDS1, LDS2 and LDS3. The flat line in the third column means the loss blows up to infinity with this choice of stepsize.

with γ {0.5, 0.8}. We observe that the imprecise γ does not affect the performance of all methods involved. As is shown in Table 3, η is involved in the parameter choice of QASGD, which we choose to be min n 1 L,

2γ z0 σ(t+1)3/2 o in our simulations. While σ in Assumption 2.4 relies on the compact set and initialization, we observe that choosing σ = 1 is a robust choice for all of our simulations by evaluating QASGD on three LDS instances with σ {1, 10, 102, 103} (See Figure 3 in Appendix F for more details). According to the analysis in Hardt et al. (2016), F(ˆΘ) is L-weakly smooth, and it is still unknown whether F(ˆΘ) is L-smooth. Given that the parameter choice of our methods involves L, we fine-tune the value of L for QASGD and QASVRG and choose the best stepsize for SGD by extensive grid search in each instance. We use the adaptive stepsize for QAGD and GD the same way as in Hinder et al. (2020). We consider the random noise ξt N(0, 10 2) or ξt = 0 perturbing the output of the true systems. If ξt N(0, 10 2), the interpolation assumption will be violated since Θ is no longer the global minimizer of the objective generated by each training example. Thus we only evaluate SGD and QASGD in this case. In Algorithm 2, it may be difficult to calculate t especially when L is unknown. Therefore we can spec-

ify t to be relatively large (we choose t = 104) and use an appropriate restart scheme in Algorithm 1 to boost the performance of QASVRG. Following Nitanda (2016), when the relation k, yk+1 yk > 0 holds, we break Algorithm 1 to return ys and start the next stage. Since we generate the initial iterate with ρ( ˆA0) < 1, we don t use gradient clipping or projection proposed in Hardt et al. (2016) during training. We generate the error bar in Figure 1 by averaging the results obtained from running each stochastic algorithm three times and choose the maximum and minimum value pointwise to be the upper bar and lower bar.

The simulation results in Figure 1 validate our methods and show the superiority of our methods in terms of the convergence speed and the overall complexity. In addition, both QASGD and QASVRG are robust to the random sampling of the stochastic gradient. Code is available at https://github.com/Qiang Fu09/Stochastic-quasar-convexacceleration. There is an interesting phenomenon in our simulations. While the convergence rate of QASGD matches the rate of SGD when t is large, Figure 1 still shows the substantial superiority of QASGD over QASVRG. In fact, QASGD enjoys a convergence rate of O 1 t2 + 1

Accelerated Stochastic Optimization Methods under Quasar-convexity

indicating rapid initial phase where O 1

t2 dominates the convergence. We speculate that QASGD in most of our simulations does not escape the initial phase and thus enjoys a fast convergence. The simulation results above lead to a reconsideration about whether we need the bounded gradient assumption in QASGD or not, as the objective of LDS does not satisfy this assumption but the performance of QASGD on LDS is still favorable and robust. We believe that this assumption is used in the theoretical analysis but may not be necessary.

5. Disscussion

In this paper, we propose QASGD and QASVRG achieving fast convergence and low complexity under their corresponding assumptions. We present our algorithms in a unified framework using a single energy-based analysis to establish the convergence rates and complexity of QAGD, QASGD and QASVRG. We close with a brief discussion of some possible future work.

First, we introduced the bounded gradient assumption for QASGD, but it remains to be seen whether we can weaken this assumption to some extent. Given that Gower et al. (2021) and Jin (2020) establish the convergence of SGD for quasar-convex functions under the ER condition and bounded variance assumption respectively, it would be of interest to see whether it is possible to apply these weaker assumptions in QASGD.

Second, QAGD are proven near-optimal in Hinder et al. (2020) based on the lower bound they establish for firstorder deterministic methods. In future work we hope to establish a worst case complexity lower bound for first-order stochastic methods applied to quasar-convex functions under different assumptions. We expect such bounds will prove that our methods are nearly optimal as well.

Moreover, a higher order method usually leads to a better convergence rate. Nesterov & Polyak (2006) propose the cubic regularized Newton method to optimize star-convex functions (γ = 1). Therefore, we are also interested in whether it is possible to use higher order methods to improve the convergence of our methods.

Finally, we hope to exploit more applications of quasarconvex functions in machine learning.

Acknowledgements

We would like to thank our anonymous reviewers for constructive comments, thank Kyurae Kim (University of Pennsylvania) for helpful discussions on citations and thank Oliver Hinder (University of Pittsburgh) and Nimit Sohoni (Stanford University) for sharing codes of LDS.

Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research, 18(1):8194 8244, 2017.

Bartlett, P., Helmbold, D., and Long, P. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. In International conference on machine learning, pp. 521 530. PMLR, 2018.

Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009.

Bottou, L. and Le Cun, Y. On-line learning for very large data sets. Applied Stochastic Models in Business and Industry, 21(2):137 151, 2005. doi: https://doi.org/10. 1002/asmb.538. URL https://onlinelibrary. wiley.com/doi/abs/10.1002/asmb.538.

Bu, J. and Mesbahi, M. A note on nesterov s accelerated method in nonconvex optimization: a weak estimate sequence approach. ar Xiv preprint ar Xiv:2006.08548, 2020.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Foster, D. J., Sekhari, A., and Sridharan, K. Uniform convergence of gradients for non-convex learning and optimization. Advances in Neural Information Processing Systems, 31, 2018.

Ge, R., Lee, J. D., and Ma, T. Matrix completion has no spurious local minimum. Advances in neural information processing systems, 29, 2016.

Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469 1492, 2012.

Ghadimi, S. and Lan, G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1):59 99, 2016.

Gower, R., Sebbouh, O., and Loizou, N. Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics, pp. 1315 1323. PMLR, 2021.

Guminov, S. and Gasnikov, A. Accelerated methods for α-weakly-quasi-convex problems. ar Xiv preprint ar Xiv:1710.00797, 2017.

G urb uzbalaban, M., Ozdaglar, A., and Parrilo, P. A globally convergent incremental newton method. Mathematical Programming, 151(1):283 313, 2015.

Accelerated Stochastic Optimization Methods under Quasar-convexity

Hardt, M., Ma, T., and Recht, B. Gradient descent learns linear dynamical systems. ar Xiv preprint ar Xiv:1609.05191, 2016.

Hazan, E. and Kale, S. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489 2512, 2014.

Hinder, O., Sidford, A., and Sohoni, N. Near-optimal methods for minimizing star-convex functions and beyond. In Conference on learning theory, pp. 1894 1938. PMLR, 2020.

Jin, J. On the convergence of first order methods for quasarconvex optimization. ar Xiv preprint ar Xiv:2010.04937, 2020.

Kleinberg, B., Li, Y., and Yuan, Y. An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, pp. 2698 2707. PMLR, 2018.

Kulunchakov, A. and Mairal, J. Estimate sequences for stochastic composite optimization: Variance reduction, acceleration, and robustness to noise. The Journal of Machine Learning Research, 21(1):6184 6235, 2020.

Ma, S., Bassily, R., and Belkin, M. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pp. 3325 3334. PMLR, 2018.

Ma, T. Why do local methods solve nonconvex problems?, 2020.

Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pp. 6820 6829. PMLR, 2020.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

Nesterov, Y. A method for solving the convex programming problem with convergence rate o(1/k2). Proceedings of the USSR Academy of Sciences, 269:543 547, 1983.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.

Nesterov, Y. and Polyak, B. T. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Nesterov, Y., Gasnikov, A., Guminov, S., and Dvurechensky, P. Primal-dual accelerated gradient descent with line search for convex and nonconvex optimization problems. ar Xiv preprint ar Xiv:1809.05895, 2018.

Nitanda, A. Accelerated stochastic gradient descent for minimizing finite sums. In Artificial Intelligence and Statistics, pp. 195 203. PMLR, 2016.

Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. ar Xiv preprint ar Xiv:1109.5647, 2011.

Recht, B., Re, C., Wright, S., and Niu, F. Hogwild!: A lockfree approach to parallelizing stochastic gradient descent. Advances in neural information processing systems, 24, 2011.

Schmidt, M. and Roux, N. L. Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv preprint ar Xiv:1308.6370, 2013.

Vaswani, S., Bach, F., and Schmidt, M. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pp. 1195 1204. PMLR, 2019.

Wang, J.-K. and Wibisono, A. Continuized acceleration for quasar convex functions in non-convex optimization. ar Xiv preprint ar Xiv:2302.07851, 2023.

Wilson, A. C., Recht, B., and Jordan, M. I. A lyapunov analysis of accelerated methods in optimization. J. Mach. Learn. Res., 22:113 1, 2021.

Zhou, Y., Yang, J., Zhang, H., Liang, Y., and Tarokh, V. Sgd converges to global minimum in deep learning via star-convex path. ar Xiv preprint ar Xiv:1901.00451, 2019.

Accelerated Stochastic Optimization Methods under Quasar-convexity

A. Helpful Lemmas and Assumptions

Lemma A.1 (Three-point identity). For all x dom h and y, z int(dom h)

Dh(x, y) Dh(x, z) = h(y) h(z), x y Dh(y, z), (12)

where Dh(x, y) = h(x) h(y) h(y), x y .

Lemma A.2 (Fenchel-Young inequality). For all x, y Rd and p = 0, we have

p p 1 , (13)

where denotes the conjugate norm of .

Lemma A.3 (Hinder et al. (2020), Lemma 2). Let f : Rd R be differentiable and let y, z Rd. For τ R define xτ τy + (1 τ)z. For any c 0 there exists τ [0, 1] such that

τ f(xτ), y z c (f(y) f(xτ)) . (14)

In fact, we can slacken condition (14) to some extent, and we can find an appropriate momentum parameter τ satisfying

τ f(xτ), y z τ 2b y z 2 c (f(y) f(xτ)) + ϵ, (15)

for b, c, ϵ 0. Existence of τ satisfying condition (14) implies the existence of τ satisfying condition (15).

Lemma A.4. If f : Rd R is L-smooth, then for any x, y Rd

f(y) f(x) + f(x), y x + L

2 y x 2. (16)

Now We introduce an algorithm to effectively search the good τ for any L-smooth functions.

Lemma A.5 (Hinder et al. (2020), C.2). For L-smooth f : Rd R, x, v Rn and scalars b, c, ϵ 0, Algorithm (3) computes α [0, 1] satisfying (15) with at most

6 + 3 log+ 2

(4 + c) min 2L3

b3 , L y z 2

function and gradient evaluations, where log+( ) = max{log( ), 1}.

Algorithm 3 Bisearch(f, y, z, b, c, ϵ,[guess])

Require: f is L-smooth; z, y Rd; c 0; guess [0, 1] if provided. guess can be the momentum parameter under convexity or other.

1: Define g(α) = f(τy + (1 τ)z) and p = b z y 2

2: if guess provided and cg(guess) + guess (g (guess) guess p) cg(1) then return guess 3: if g (1) p + ϵ then return 1; 4: else if c = 0 or g(0) g(1) + ϵ/c then return 0; 5: δ 1 g (1)/L z y 2

6: lo 0, hi δ, τ δ 7: while cg(τ) + τ(g (τ) τp) > cg(1) + ϵ do 8: τ (lo + hi)/2 9: if g(τ) g(δ) then hi τ 10: else lo τ 11: end while output τ

Proposition A.6. Based on Assumption 3.1, we have κ γ 2 γ , where κ = L µµ and µ > 0.

Accelerated Stochastic Optimization Methods under Quasar-convexity

γµ 2 γ Dh(x , x) f(x) f(x ) L

µ Dh(x , x)

Thus, we have γµ 2 γ L

µ . Furthermore, we have κ q

Lemma A.7. Suppose Assumption 2.7 holds, and each fi FL. If k = fi(xk+1) fi(yk) + 1

n Pn i=1 fi(yk), we can obtain Ei k f(xk+1) 2 4L (f(xk+1) f(x ) + f(yk) f(x )) (17)

Proof. Since f is L-smooth, for any x, y Rd we have

f(x) f(y) + f(y), x y + L

2 x y 2 g(x)

Let g( x) = 0, and x = y 1

L f(y) is the minimizer of g(x). And we have

f(x ) f(x) g( x) = f(y) 1

2L f(y) 2 f(y) 2 2L(f(y) f(x ))

Since f(x ) = 0, we have f(y) f(x ) 2L(f(y) f(x )) (18)

By (18), we can upper bound E k f(xk+1) 2 as follows

Ei k f(xk+1) 2 = Ei fi(xk+1) fi(yk) Ei[ fi(xk+1) fi(yk)] 2

Ei fi(xk+1) fi(yk) 2

= Ei fi(xk+1) f(x ) + f(x ) fi(yk) 2

2Ei fi(xk+1) f(x ) 2 + 2Ei f(x ) fi(yk) 2

4LEi[fi(xk+1) fi(x )] + 4LEi[fi(yk) fi(x )]

= 4L (f(xk+1) f(x ) + f(yk) f(x )) ,

where the first inequality uses the relation E X E[X] 2 E X 2 for all random variable X; the second inequaity uses the relation a + b 2 2 a 2 + 2 b 2.

Lemma A.8. Let {ξi}n i=1 be a set of vectors in Rd and ξ denote an average of {ξi}n i=1. Let I denote a uniform random variable representing a subset of {1, 2, ..., n} with its size equal to b. Then, it follows that,

= n b b(n 1)Ei ξi ξ 2.

We orient our readers to the supplementary materials of Nitanda (2016) for proof details of the above lemma.

B. Proof of Proposition 3.4

Based on Lemma A.7 and Lemma A.8, we prove 3.4.

Proof. Let i k = fi(xk+1) fi( x) + f( x) and k = 1 bk P

i Ik i k. Using Lemma (A.8), we have

EIk k f(xk+1) 2 = 1

n 1 Ei i k f(xk+1) 2 4L n bk

bk(n 1) (f(xk+1) f(x ) + f( x) f(x )) ,

(19) where the inequality follows from Lemma (A.7).

Accelerated Stochastic Optimization Methods under Quasar-convexity

C. Proofs of Convergence Rates

C.1. Proof of QAGD

γ-quasar-convex (µ = 0)

4L (k + 1)2, Bk = 1

(αk, βk, ρk, f, b, c, ϵ) 0, γ

L, f, 0, γAk

µ-strongly γ-quasar-convex (µ > 0)

Ak = (1 + γ/2 κ)k, Bk = µAk

(αk, βk, ρk, f, b, c, ϵ) γµ, γµBk bk , 1

Table 2. Parameter choices for QAGD

we begin with Algorithm 1, k = f(xk+1) and parameters specified in Table 2 using Lyapunov function (11):

Case 1: µ = 0 For k = f(xk+1), we begin with Algorithm 1 using Lyapunov function (11):

Ek+1 Ek (12) = h(zk+1) h(zk), x zk+1 Dh(zk+1, zk) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk f(xk+1), x zk+1 Dh(zk+1, zk) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk f(xk+1), x zk + 1

βk f(xk+1), zk zk+1 µ

2 zk+1 zk 2

+ Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

(13) 1 βk f(xk+1), x zk + 1 2 µβ2 k f(xk+1) 2 + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk f(xk+1), x xk+1 + τk

βk f(xk+1), yk zk + 1 2 µβ2 k f(xk+1) 2 + Ak+1(f(yk+1) f(xk+1))

+ (Ak+1 Ak)(f(xk+1) f(x )) + Ak(f(xk+1) f(yk))

βk (f(x ) f(xk+1)) + 1

βk (c(f(yk) f(xk+1)) + ϵ) + 1 2 µβ2 k f(xk+1) 2 + Ak+1(f(yk+1) f(xk+1))

+ (Ak+1 Ak)(f(xk+1) f(x )) + Ak(f(xk+1) f(yk))

The second inequality follows from the Fenchel-Young inequality, and the last inequality follows from the quasar-convexity of f and (15) (Lemma A.3). With the choice of parameter summarized in Table 2, we obtain the following bound:

βk (f(x ) f(xk+1)) + 1

βk (c(f(yk) f(xk+1)) + ϵ) + 1 2 µβ2 k f(xk+1) 2 + Ak+1(f(yk+1) f(xk+1))

+ (Ak+1 Ak)(f(xk+1) f(x )) + Ak(f(xk+1) f(yk))

= 1 2 µβ2 k f(xk+1) 2 + Ak+1(f(yk+1) f(xk+1)) + (Ak+1 Ak)

(16) 1 2 µβ2 k Ak+1

f(xk+1) 2 + (Ak+1 Ak)

The last inequality follows from (16). In fact, (20) is implied by the gradient descent with ρk = 1/L and L-smoothness.

f(yk+1) f(xk+1) f(xk+1), yk+1 xk+1 + L

2 yk+1 xk+1 2 = 1

2L f(xk+1) 2 (20)

Accelerated Stochastic Optimization Methods under Quasar-convexity

With the choice of βk and Ak, we have

1 2 µβ2 k = a2 k 2 µγ2 = µγ2

32L2 (2k + 3)2 µγ2

8L2 (k + 2)2 = Ak+1

Thus, we obtain the final bound:

Ek+1 Ek 1 2 µβ2 k Ak+1

f(xk+1) 2 + (Ak+1 Ak)

2 ϵ (Ak+1 Ak)

By summing both sides of the inequality above, we obtain

f(yt) f(x ) A 1 t (A0(f(y0) f(x )) + Dh(x , z0)) + ϵ

2 (16) A 1 t

8 y0 x 2 + Dh(x , z0) + ϵ

4 + 1 Dh(x , z0) + ϵ

8LDh(x , z0)

γ2 µ(t + 1)2 + ϵ

γ2 µ(t + 1)2 + ϵ

Case 2: µ > 0

Ek+1 Ek (12) = bk Dh(x , zk+1) Bk h(zk+1) h(zk), x zk+1 Bk Dh(zk+1, zk)

+ Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

= bk Dh(x , zk+1) + αk Bk

βk h(zk+1) h(xk+1), x zk+1 + Bk

βk f(xk+1), x zk+1

Bk Dh(zk+1, zk) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

(12) = bk αk Bk

Dh(x , zk+1) + αk Bk

βk (Dh(x , xk+1) Dh(zk+1, xk+1)) + Bk

βk f(xk+1), x zk

βk f(xk+1), zk zk+1 Bk Dh(zk+1, zk) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk Dh(x , xk+1) µαk Bk

2βk zk+1 xk+1 2 + Bk

βk f(xk+1), x zk

βk f(xk+1), zk zk+1 µBk

2 zk+1 zk 2 + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk Dh(x , xk+1) µαk Bk

2βk zk+1 xk+1 2 + Bk

βk f(xk+1), x xk+1

c(f(yk) f(xk+1)) + b xk+1 zk 2 + Bk

βk f(xk+1), zk zk+1 µBk

2 zk+1 zk 2

+ Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk Dh(x , xk+1) + Bk

βk f(xk+1), zk zk+1 + µαk Bk

βk f(xk+1), x xk+1 + Ak(f(yk) f(xk+1)) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

The first equality and the third equality follows from Lemma A.1; the second equality follows from mirror descent. The first inequality follows from the strong convexity of h, and the second inequality follows from (15) (Lemma A.3). With the choice of αk and βk, we have αk Bk/βk = bk, which explains the first inequality. Moreover, with the choice of Bk and Observation A.6, we have αk βk = bk Bk = γ 2 κ γ

Accelerated Stochastic Optimization Methods under Quasar-convexity

Combined with the initial bound and the relation above, we obtain the following bound:

Ek+1 Ek αk Bk

βk Dh(x , xk+1) + Bk

βk f(xk+1), zk zk+1 + µαk Bk

βk f(xk+1), x xk+1 + Ak(f(yk) f(xk+1)) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk Dh(x , xk+1) + Bk

βk f(xk+1), zk zk+1 µBk

4 zk+1 zk 2 + Bk

βk f(xk+1), x xk+1

+ Ak+1(f(yk+1) f(xk+1)) + ak(f(xk+1) f(x ))

(13)(9) αk Bk

βk Dh(x , xk+1) + Bk

µβ2 k f(xk+1) 2 + γBk

βk (f(x ) f(xk+1) µDh(x , xk+1))

+ Ak+1(f(yk+1) f(xk+1)) + ak(f(xk+1) f(x ))

µβ2 k f(xk+1) 2 + Ak+1(f(yk+1) f(xk+1))

The third inequality follows from the Fenchel-Young inequality of h and the strong quasar-convexity of f. The fifth inequality follows from L-smoothness of f. With the choice of Bk, βk and Ak, we have

Bk µβ2 k = (γ/2 κ)2Ak

which explains the last inequality. Therefore, we obtain

f(yt) f(x ) 1 + γ 2 κ

t (f(y0) f(x ) + µDh(x , z0)) = 1 + γ 2 κ

C.2. Proof of Theorem 3.2

γ-quasar-convex (µ = 0)

Ak = η(k + 1)2, η = min µ L, q

σ2 γ (t+1)3/2

(αk, βk, ρk, f, b, c, ϵ) 0, γ

ak , 0, fi, 0, γAk

µ-strongly γ-quasar-convex (µ > 0)

Step 1: Choose Ak, Bk, θk as follows until convergence

Ak = 1 + min γ2 µ2/16, 1/2 k, Bk = µAk

(αk, βk, ρk, f, b, c, ϵ) γµ

2 bk , 0, fi, γ µµ

Step 2: Restart and choose Ak, Bk, θk as follows

36 k + max{48/γ2 µ2, 5} 2, Bk = µAk

(αk, βk, ρk, f, b, c, ϵ) γµ

2 bk , 0, fi, γ µµ

Table 3. Parameter choices for QASGD

Note that yk is identical to xk since ρk = 0. Thus we can substitute yk in Lyapunov functions (11) with xk. We begin with Algorithm 1, k = fi(xk+1) and parameters specified in Table 3 using Lyapunov function (11):

Accelerated Stochastic Optimization Methods under Quasar-convexity

Case 1: µ = 0

Ek+1 Ek = h(zk+1) h(zk), x zk+1 Dh(zk+1, zk) + Ak+1(f(xk+1) f(x )) Ak(f(xk) f(x ))

βk k, x zk+1 Dh(zk+1, zk) + Ak+1(f(xk+1) f(x )) Ak(f(xk) f(x ))

βk k, x zk + 1

βk k, zk zk+1 µ

2 zk+1 zk 2 + Ak+1(f(xk+1) f(x ))

Ak(f(xk) f(x ))

βk k, x zk + 1 2 µβ2 k k 2 + Ak+1(f(xk+1) f(x )) Ak(f(xk) f(x ))

βk k, x xk+1 + τk

βk k, xk zk + 1 2 µβ2 k k 2 + Ak

βk k, x xk+1 + 1

βk (c(fi(xk) fi(xk+1)) + ϵ) + 1 2 µβ2 k k 2 + Ak

where Ak ak(f(xk+1) f(x )) + Ak(f(xk+1) f(xk)). The second equality follows from mirror descent; the first inequality follows from the strong convexity of h; the second inequality follows from the Fenchel-Young inequality, and the last inequality follows from (15) (Lemma A.3). We take the expectation of both sides of the initial bound and obtain

E[Ek+1 Ek] 1

βk f(xk+1), x xk+1 + c

βk (f(xk) f(xk+1)) + 1 2 µβ2 k E k 2 + ak

+ ak(f(xk+1) f(x )) + Ak(f(xk+1) f(xk))

βk (f(x ) f(xk+1)) + Ak(f(xk) f(xk+1)) + 1 2 µβ2 k E k 2 + ak

+ ak(f(xk+1) f(x )) + Ak(f(xk+1) f(xk))

1 2 µβ2 k E k 2 + ak

By summing both sides of the inequality above, we obtain

E[f(xt) f(x )] A 1 t

A0(f(x0) f(x )) + Dh(x , z0) +

1 2 µβ2 k E k 2 !

2Dh(x , z0) +

1 2 µβ2 k E k 2 !

2Dh(x , z0)

η(t + 1)2 + σ2η

γ2 µ(t + 1) + ϵ

2LDh(x , z0)

µ(t + 1)2 + 2σ

µ(t + 1)2 + 2σR γ µ t + 1 + ϵ

Case 2: µ > 0

Ek+1 Ek = bk Dh(x , zk+1) Bk h(zk+1) h(zk), x zk+1 Bk Dh(zk+1, zk)

+ ak(f(xk+1) f(x )) + Ak(f(xk+1) f(xk))

= bk Dh(x , zk+1) + αk Bk

βk h(zk+1) h(xk+1), x zk+1 + Bk

βk k, x zk+1

Bk Dh(zk+1, zk) + ak(f(xk+1) f(x )) + Ak(f(xk+1) f(xk))

(12) = bk αk Bk

Dh(x , zk+1) + αk Bk

βk (Dh(x , xk+1) Dh(zk+1, xk+1)) + Bk

βk k, zk zk+1 Bk Dh(zk+1, zk) + ak(f(xk+1) f(x )) + Ak(f(xk+1) f(xk))

Accelerated Stochastic Optimization Methods under Quasar-convexity

The second equality follows from mirror descent. With the choice of βk, αk and Observation A.6, we have bk = αk Bk/βk and αk/βk 1/2. Therefore, We obtain the following bound:

Ek+1 Ek αk Bk

βk Dh(x , xk+1) µαk Bk

2βk zk+1 xk+1 2 + Bk

βk k, x xk+1 + τk Bk

βk k, xk zk

βk k, zk zk+1 µBk

2 zk+1 zk 2 + Ak

βk Dh(x , xk+1) µαk Bk

2βk zk+1 xk+1 2 + Bk

βk k, x xk+1 + Bk

βk (c(fi(xk) fi(xk+1))

+ b zk xk+1 2) + Bk

βk k, zk zk+1 µBk

2 zk+1 zk 2 + Ak

βk Dh(x , xk+1) µBk

4 zk+1 zk 2 + Bk

βk k, zk zk+1 + Bk

βk k, x xk+1

+ Ak(fi(xk) fi(xk+1)) + Ak

βk Dh(x , xk+1) + Bk

µβ2 k k 2 + Bk

βk k, x xk+1 + Ak + Ak(fi(xk) fi(xk+1))

where Ak ak(f(xk+1) f(x )) + Ak(f(xk+1) f(xk)). The second inequality follows from (15) (Lemma A.3); the second inequality follows from the triangle inequality and the last inequality follows from the Fenchel-Young inequality. We take the expectation of both sides of the inequality above and obtain

E[Ek+1 Ek] αk Bk

βk Dh(x , xk+1) + Bk

µβ2 k E k 2 + Bk

βk f(xk+1), x xk+1 + ak(f(xk+1) f(x ))

βk Dh(x , xk+1) + Bk

µβ2 k E k 2 + γBk

βk (f(x ) f(xk+1) µDh(x , xk+1))

+ ak(f(xk+1) f(x ))

2βk Dh(x , xk+1) + 2µ2Bk

µβ2 k x xk+1 2 + σ2Bk

µ2β2 k γµBk

Dh(x , xk+1) + σ2Bk

Next We prove that 4µ2Bk/ µ2β2 k γµBk/2βk. It suffices to prove that 4µ/ µ2βk γ/2.

4µ µ2βk = 8 ak γ µ2Ak 8 γ µ2 γ2 µ2

Thus we obtain the final bound:

E[Ek+1 Ek] σ2Bk

µβ2 k By summing both sides of the inequality above and using the notation min{γ2 µ2/16, 1/2}, we obtain

E[f(xt) f(x )] A 1 t

f(x0) f(x ) + µDh(x , z0) +

f(x0) f(x ) + µDh(x , z0) +

4σ2(Ak+1 Ak)2

f(x0) f(x ) + µDh(x , z0) +

f(x0) f(x ) + µDh(x , z0) + 4σ2q At

= (1 + ) t E0 + µσ2

Accelerated Stochastic Optimization Methods under Quasar-convexity

In Step 1, the algorithm is run until convergence and we let x0 be the last iterate of the Step 1. Assume E[f(x0) f(x ) +

µDh(x , z0)] µσ2

4µ , and we restart the algorithm using the parameters in Step 2 and the notation m = max n 48 γ2 µ2 , 5 o . Then We obtain

E[f(xt) f(x )] A 1 t

f(x0) f(x ) + µDh(x , z0) +

4σ2(Ak+1 Ak)2

µσ2(2k + 2m + 1)2

9γ2µ(k + m)2

γ2 µµ(t + m)2 + 36σ2

γ2 µµ(t + m)

Additionally, we need to verify that the parameters in Step 2 satisfy two essential relations, αk/βk 1/2 and 4µ/ µ2β2 k γ/2, which are key to obtaining the final bound of Ek+1 Ek.

αk βk = Ak+1 Ak

Ak = 2k + 2m + 1

(k + m)2 2m + 1

4µ µ2β2 k 8(Ak+1 Ak)

γ µ2Ak 8 γ µ2 2m + 1

m2 8 γ µ2 3 m 8 γ µ2 γ2 µ2

C.3. Proof of Theorem 3.5

γ-quasar-convex (µ = 0)

16L(k + 1)2, Bk = 1

Batchsize bk = l γ µn(2k+3) 2(n 1)p+γ µ(2k+3) m , p γ µ

(αk, βk, ρk, f, b, c, ϵ) 0, γ 2 ak , 1

L, f Ik, 0, γAk

2 ak , γϵf(y0)

µ-strongly γ-quasar-convex (µ > 0)

Ak = (1 + γ/

8κ)k, Bk = µAk

Batchsize bk = l 8n(

8κ+γ) γ(n 1)+8(

(αk, βk, ρk, f, b, c, ϵ) γµ

L, f Ik, γ µµ

16L(k + 1)2, Bk = 1

Batchsize bk = l γ µn(2k+3) 2(n 1)p+γ µ(2k+3) m , p γ µ

(αk, βk, ρk, f, b, c, ϵ) 0, γ 2 ak , 1

L, f Ik, 0, γAk

2 ak , γϵf(y0)

Table 4. Parameter choices for QASVRG

We begin with Algorithm 1, k = f Ik(xk+1) f Ik(y0) + f(y0) and parameters specified in Table 4 using Lyapunov function (11):

Accelerated Stochastic Optimization Methods under Quasar-convexity

case 1: µ = 0

Ek+1 Ek = h(zk+1) h(zk), x zk+1 Dh(zk+1, zk) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk k, x zk+1 Dh(zk+1, zk) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk k, x zk + 1 2 µβ2 k k 2 + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk k, x xk+1 + τk

βk k, yk zk + 1 2 µβ2 k k 2 + Ak+1(f(yk+1) f(xk+1))

+ (Ak+1 Ak)(f(xk+1) f(x )) + Ak(f(xk+1) f(yk))

βk k, x xk+1 + 1

βk (c(f Ik(yk) f Ik(xk+1)) + ϵ) + 1 2 µβ2 k k 2 + Ak+1(f(yk+1) f(xk+1))

+ (Ak+1 Ak)(f(xk+1) f(x )) + Ak(f(xk+1) f(yk)) + τk

βk f(y0) f Ik(y0), yk zk

The first equality follows from the mirror descent; the third equality follows from the momentum step; the first inequality follows from the Fenchel-Young inequality, and the last inequality follows from (15) (Lemma A.3). Then we take the expectation with respect to the history of random variable Iij with its size equal to bi, where i = 0, 1, ..., t 1 and j = 1, 2, ..., n bi , and we obtain

E[Ek+1 Ek] 1

βk f(xk+1), x xk+1 + c

βk (f(yk) f(xk+1)) + 1 2 µβ2 k EIk k 2 + ak(f(xk+1) f(x ))

+ Ak+1E [f(yk+1) f(xk+1)] + Ak(f(xk+1) f(yk)) + akf(y0)ϵ

ak(f(x ) f(xk+1)) + 1 2 µβ2 k EIk k f(xk+1) 2 + Ak+1E [f(yk+1) f(xk+1)]

+ 1 2 µβ2 k f(xk+1) 2 + akf(y0)ϵ

The first inequality follows from E k 2 = EIk k 2 , as k corresponds to the mini-batch in the kth iteration; the second inequality follows from the quasar-convexity of f and the relation EIk k 2 = EIk k f(xk+1) 2 + f(xk+1) 2. By L-smoothness and the gradient descent in Algorithm 1, we have the following relation:

E[f(yk+1) f(xk+1)] E[ f(xk+1), yk+1 xk+1 ] + L

2 E yk+1 xk+1 2

L f(xk+1) 2 + 1

2L f(xk+1) 2 + 1

2LE k f(xk+1) 2

Using the relation above, we obtain the following bound:

E[Ek+1 Ek] ak(f(x ) f(xk+1)) + 1 2 µβ2 k EIk k f(xk+1) 2 + Ak+1E [f(yk+1) f(xk+1)]

+ 1 2 µβ2 k f(xk+1) 2 + akf(y0)ϵ

ak(f(x ) f(xk+1)) + 1 2 µβ2 k + Ak+1

EIk k f(xk+1) 2

+ 1 2 µβ2 k Ak+1

f(xk+1) 2 + akf(y0)ϵ

ak(f(x ) f(xk+1)) + Ak+1

L EIk k f(xk+1) 2 + akf(y0)ϵ

(19) ak(f(x ) f(xk+1)) + 4Ak+1δk(f(xk+1) f(x )) + 4Ak+1δk(f(y0) f(x )) + akf(y0)ϵ

Accelerated Stochastic Optimization Methods under Quasar-convexity

We use the notation δk n bk bk(n 1). The third inequality follows from the relation 1/ µβ2 k Ak+1/L. Actually, with the choice of βk and Ak, we have

1 µβ2 k = 4 a2 k µγ2 = 4η2(2k + 3)2

µγ2 16η2(k + 2)2

where η = γ2 µ/16L. The last inequality follows from Proposition 3.1 (19). Next we prove the relation 4Lδk/βk p. It suffices to prove δk γp/8L ak.

δk = n bk bk(n 1) 2np 2(n 1)p + γ µ(2k + 3) 2(n 1)p + γ µ(2k + 3)

γ µn(2k + 3) = 2p γ µ(2k + 3) = γp 8L ak

Thus we have 4Ak+1δk p Ak+1βk/L, by which we obtain the final bound of E[Ek+1 Ek].

E[Ek+1 Ek] ak(f(x ) f(xk+1)) + 4Ak+1δk(f(xk+1) f(x )) + 4Ak+1δk(f(y0) f(x )) + akf(y0)ϵ

ak p Ak+1βk

(f(x ) f(xk+1)) + p Ak+1βk

L (f(y0) f(x )) + akf(y0)ϵ

γ µ (f(y0) f(x )) + akf(y0)ϵ

γ µ f(y0) + akf(y0)ϵ

The third inequality follows from two relations, ak p Ak+1βk/L and Ak+1βk/L 8 ak/γ µ and the last inequality follows from f(x ) 0. With the choice of Ak and βk, we have

L = γp Ak+1

2L ak γ2 µ(k + 2)2

32L(2k + 3) γ2 µ(2k + 3)

2L ak = γ(k + 2)2

2L(2k + 3) γ(2k + 3)

Summing both sides of the inequality about the final bound, we obtain

E[f(yt) f(x )] A 1 t (A0(f(y0) f(x )) + Dh(x , z0)) + A 1 t

γ µ f(y0) + f(y0)ϵ

17LDh(x , z0)

γ2 µ(t + 1)2 + 8p

γ µ + ϵ f(y0)

γ2 µ(t + 1)2 + 8p

γ µ + ϵ f(y0)

17LDh(x ,z0)

γ2 µqf(y0) m , we have

E[f(yt) f(x )] q + 8p

γ µ + ϵ f(y0),

where q + 8p/γ µ + ϵ < 1 with the choice of q and p. Next we conclude the convergence rate of global stages. Suppose ys is the output of stage s. If t lq

17LDh(x ,ys)

γ2 µqf(ys) m at each stage, we have

E[f(ys) f(x )] q + 8p

γ µ + ϵ s+1 f(y0)

Accelerated Stochastic Optimization Methods under Quasar-convexity

Case 2: µ > 0 (Option I) Using the notation dk+1 = Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x )), we obtain

Ek+1 Ek = bk Dh(x , zk+1) Bk h(zk+1) h(zk), x zk+1 Bk Dh(zk+1, zk) + dk+1

bk Dh(x , zk+1) + αk Bk

βk h(zk+1) h(xk+1), x zk+1 + Bk

βk k, x zk+1

2 zk+1 zk 2 + dk+1

Dh(x , zk+1) + αk Bk

βk (Dh(x , xk+1) Dh(zk+1, xk+1)) + Bk

βk k, x xk+1

βk k, yk zk + Bk

βk k, zk zk+1 µBk

2 zk+1 zk 2 + dk+1

βk Dh(x , xk+1) µαk Bk

2βk zk+1 xk+1 2 + Bk

βk k, x xk+1 + Bk

βk k, zk zk+1

2 zk+1 zk 2 + Bk

c(f Ik(yk) f Ik(xk+1)) + b xk+1 zk 2 + dk+1 + ξk

βk Dh(x , xk+1) + Bk

βk k, x xk+1 + Bk

µβ2 k k 2 + c Bk

βk (f Ik(yk) f Ik(xk+1)) + dk+1 + ξk

Note that ξk = Bk

βk f(y0) f Ik(y0), xk+1 zk , and E [ξk] = 0. The first inequality follows from the µ-strong convexity of function h; the second inequality follows from (15) (Lemma A.3), and the last inequality follows from the triangle inequality with the relation αk/βk 1/2. Then we take the expectation with respect to the history of random variable Iij with its size equal to bi, where i = 0, 1, ..., t 1 and j = 1, 2, ..., n bi .

E [Ek+1 Ek] = αk Bk

βk Dh(x , xk+1) + Bk

βk f(xk+1), x xk+1 + Bk

µβ2 k EIk k 2 + c Bk

βk (f(yk) f(xk+1))

+ ak(f(xk+1) f(x )) + Ak(f(xk+1) f(yk)) + Ak+1E [f(yk+1) f(xk+1)]

βk Dh(x , xk+1) + γBk

βk (f(x ) f(xk+1)) + c Bk

βk (f(yk) f(xk+1))

µβ2 k EIk k 2 + ak(f(xk+1) f(x )) + Ak(f(xk+1) f(yk)) + Ak+1E [f(yk+1) f(xk+1)]

(f(x ) f(xk+1)) + Bk

µβ2 k EIk k f(xk+1) 2 + Bk

µβ2 k f(xk+1) 2

2L EIk k f(xk+1) 2 Ak+1

2L f(xk+1) 2

(f(x ) f(xk+1)) + Ak+1

L EIk k f(xk+1) 2

ak(f(x ) f(xk+1)) + 4Ak+1δk(f(xk+1) f(x )) + 4Ak+1δk(f(y0) f(x ))

The first inequality follows from the uniform quasar-convexity of function f; the second inequality follows from (21); the third inequality follows from the relation Bk/ µβ2 k Ak+1/2L, and the last inequality follows from Proposition 3.1 (19). We testify the correctness of relations above. With the choice of Ak, Bk, αk, βk and Observation A.6, we have

αk βk = bk Bk = γ

Bk µβ2 k = 4(Ak+1 Ak)2

γ2 µµAk = 4Ak(γ/

Next we prove 4Ak+1δk ak/2. It suffices to prove δk ak/8Ak+1.

δk = n bk bk(n 1) γ 8(

8κ + γ) = ak 8Ak+1

Accelerated Stochastic Optimization Methods under Quasar-convexity

Thus we obtain the final bound of E[Ek+1 Ek].

E[Ek+1 Ek] ak(f(x ) f(xk+1)) + 4Ak+1δk(f(xk+1) f(x )) + 4Ak+1δk(f(y0) f(x ))

2 (f(y0) f(x ))

Summing both sides of the inequality above, we obtain

E[f(yt) f(x )] 1 + γ

t (f(y0) f(x ) + µDh(x , z0)) + 1

2(f(y0) f(x ))

γ (f(y0) f(x )) + 1

2(f(y0) f(x ))

When t log1+γ/

8κ(2/γq), we have

E[f(yt) f(x )] q + 1

(f(y0) f(x )),

where q + 1/2 < 1 with the choice of q. Next we conclude the convergence rate of global stages. Suppose ys is the output of stage s. If t log1+γ/

8κ(2/γq) at each stage, we have

E[f(ys) f(x )] q + 1

s+1 (f(y0) f(x )) = q + 1

For µ > 0 (Option II), our analysis is identical to case 1.

D. Proofs of Complexity

In this section, we analyze the overall complexity of QASGD and QASVRG. As the complexity of QAGD has been analyzed in Hinder et al. (2020), we will not provide the related proof. Analogously, we apply the same metric as Hinder et al. (2020) propose, which is the total number of function and gradient evaluations. Roughly speaking, the overall complexity is the multiplication of the number of iterations and the number of function and gradient evaluations per iteration. We consider the worst case of Bisearch where 0, 1 and guess do not meet our conditions, and we need to do line search at each iteration. In this situation, we access the gradient (estimate) in Bisearch that can be directly used in the subsequent updates at each iteration, i.e., no additional access to k is required in the mirror descent step and the gradient descent step. The number of fi involved in Bisearch affects the complexity. While QAGD needs all fi in Bisearch, QASGD and QASVRG only need a single fi and a mini-batch of fi respectively. Next we present the analysis of the complexity of Bisearch per iteration for QASGD and QASVRG in the Euclidean setting where Dh(x, y) = 1

2 x y 2, and present the proof of Theorem

3.5 and Theorem 3.6. Lemma A.5 implies that Bisearch needs O log+ (1 + c) min n L yk zk 2

b3 o function and gradient evaluations per iteration for a single function. As ϵ and b can not be simultaneously non-zero, we have two different situations corresponding to different complexity.

O log+ (1 + c) min L yk zk 2

O log+ (1 + c) L3

O log+ (1 + c) L yk zk 2

When µ = 0, the key to our proof is to bound yk zk .

D.1. Proof of Corollary 3.3

Case 1: µ = 0

Accelerated Stochastic Optimization Methods under Quasar-convexity

By the proof of 3.2, we have E[Ek+1 Ek] σ2

2β2 k + akϵ

2 . Assuming x z0 R, we have the following relation.

1 2E x zk 2 A0(f(x0) f(x )) + 1

2β2 j + Akϵ

2γ2 (2j + 3)2 + ηϵ

5R2 + (k + 1)2

The third inequality follows from η = min 1 L, q

σ2 γ (t+1)3/2

. Combining the analysis above, we have E x

zk 2 10R2 + (k+1)2

L ϵ and E x zk p

10R2 + (k+1)2

L ϵ by Jensen s inequality. Thus we obtain

E zk zk 1 = E 1 βk 1 k 1

E x zk + E x zk 1 2

10R2 + (k + 1)2

As the stepsize of gradient descent is 0, yk is identical to xk, and we solely need to bound xk zk . By the definition of zk and xk, we have

E xk zk = E (1 τk 1)zk 1 + τk 1xk 1 zk 1 + 1 βk 1 k 1

= E τk 1(xk 1 zk 1) + 1 βk 1 k 1

τk 1E xk 1 zk 1 + E 1 βk 1 k 1

E xk 1 zk 1 + 2

10R2 + (k + 1)2

E xk 1 zk 1 + 2

10R + 2(k + 1) r ϵ

The last inequality follows from the relation

a2 + b2 a + b for a, b 0. By the proof of Theorem 3.2, we have

E[f(xt) f(x )] LR2

(t + 1)2 + 2σ

2(t + 1) LR2 +

Suppose k kmax =

, and we obtain

10Rk + 4 r ϵ

ϵ2 + 64 LR2 +

γ2ϵ2 + 64 LR2 +

Accelerated Stochastic Optimization Methods under Quasar-convexity

The first inequality follows from the relation k + 3 4k for k 1. By Markov s inequality, Pr( xk zk kmax E xk

zk ) 1 kmax , which implies xk zk kmax E xk zk O L11/2R12

γ6ϵ11/2 with probability at least 1 1 kmax . Thus we

have L xk zk 2

γ13ϵ12 . Besides, we have c + 1 = γ(k+1)2

2k+3 + 1 γk + 2 4(LR2+

γϵ2 + 2 = O L2R4

Then we obtain the following upper bound of the term inside log+():

(1 + c) min L xk zk 2

= (1 + c)L xk zk 2

Thus we have O log+ (1 + c) min n LE yk zk 2

b3 o O(log+(LR2γ 1ϵ 1)). In the case of µ = 0, QASGD

iterations to generate an ϵ-approximate solution under expectation. Therefore, the overall

complexity of QASGD is upper bounded by (µ = 0) is O q

γϵ with high probability.

Case 2: µ > 0

As ϵ = 0 and b > 0, we have O log+ (1 + c) min n L yk zk 2

b3 o = O log+ (1 + c) L3

b3 . In Step 1, b = γµ

4 and c = 8

γ . Then we have

γ3µ3 = 576κ3

Thus we have O log+ (1 + c) min n LE yk zk 2

b3 o = O log+ κ3/4

γ . By the proof of Theo-

rem 3.2, QASGD needs O 1 γ2 log f(x0) f(x ))

γϵ iterations in Step 1, and the complexity of Step 1 is

O 1 γ2 log f(x0) f(x ))

γϵ log+ κ3/4

γ . In step 2, b = γµ

4 and c = γ(k+48/γ2)2

2(2k+96/γ2+1) γ(k+48/γ2)

2γ . We can upper bound the convergence rate of Step 2:

E[f(xt) f(x )] 9σ2

γ2µ(t + 48/γ2)2 + 36σ2

γ2µ(t + 48/γ2) 45σ2

γ2µ(t + 48/γ2).

Suppose k j 90σ2 γ2µϵ k . We have c + 1 C3

γ3ϵ, where C3 = 45σ2

γ3 = 64C3κ3

Thus O log+ (1 + c) min n L yk zk 2

b3 o = O log+ κ2/3

γϵ1/6 . QASGD needs O σ2 γ2ϵ iterations in Step 2, and

the complexity of Step 2 is O σ2 γ2ϵ log+ κ2/3

γϵ1/6 . In summary, the overall complexity of QASGD (µ > 0) is upper

bounded by O 1 γ2 log f(x0) f(x ))

γϵ log+ κ3/4

γ2ϵ log+ κ2/3

D.2. Proof of Corollary 3.6

Case 1: µ = 0

By the proof of Theorem 3.3, we have E[Ek+1 Ek] ak 1

2 + ϵ f(y0). Assuming x z0 R, we have the following relation. 1 2E x zk 2 A0(f(y0) f(x )) + 1

2 x z0 2 + Ak

2 + ϵ f(y0)

z0 x 2 + 3γ2

32L(k + 1)2f(y0) + γ2

16L(k + 1)2f(y0)ϵ

32L(k + 1)2f(y0) + γ2

16L(k + 1)2f(y0)ϵ

Accelerated Stochastic Optimization Methods under Quasar-convexity

Thus we have E x zk 2 2R2 + 3γ2

16L(k + 1)2f(y0) + γ2

8L(k + 1)2f(y0)ϵ and by Jensen s inequality

E k 1 = βk 1E zk zk 1 βk 1E( x zk + x zk 1 )

16L γ(2k + 1)

16L(k + 1)2f(y0) + γ2

8L(k + 1)2f(y0)ϵ.

By the definition of yk and zk, we have

E yk zk = E xk 1

L k 1 zk 1 + 1 βk 1 k 1

= E (1 τk 1)zk 1 + τk 1yk 1 1

L k 1 zk 1 + 1 βk 1 k 1

τk 1E yk 1 zk 1 + 1 βk 1 1

E yk 1 zk 1 + 1 βk 1 + 1

E yk 1 zk 1 + 2k + 9

E yk 1 zk 1 + 8

16L(k + 1)2f(y0) + γ2

8L(k + 1)2f(y0)ϵ

E yk 1 zk 1 + 8

Suppose f(y0) ϵ and k jq

17LR2 2γ2qf(y0) k kmax = jq

2γ2qϵ k , and we obtain

L f(y0) + 34

L f(y0) + 34

γ2ϵ1/2q + 34

γ2ϵ1/2q = O L1/2R2

By Markov s inequality, Pr( yk zk kmax E yk zk ) 1 kmax , which implies yk zk kmax E xk zk

q3/2γ3ϵ with probability at least 1 1 kmax . Thus we have L yk zk 2

ϵ O L3R6 q3γ7ϵ4 . Besides, we have c + 1 =

2(2k+3) + 1 γk

2 = O L1/2R q1/2ϵ1/2 . Then we obtain the following upper bound of the term inside

(1 + c) min L yk zk 2

= (1 + c)L yk zk 2

Thus we have O log+ (1 + c) min n L yk zk 2

b3 o O log+ L1/2R q1/2γϵ9/14 with high probability. As we need to access the full gradient and function value evaluated at y0 per stage and the gradient and function value of mini-batch to

Accelerated Stochastic Optimization Methods under Quasar-convexity

calculate SVRG and ϵ, the overall complexity of QASVRG (µ = 0) to generate an ϵ-approximate solution is

k=0 bk log+ L1/2R q1/2γϵ9/14

γn(2k + 3) 2(n 1)p + γ(2k + 3) log+ L1/2R q1/2γϵ9/14

O 2n + γnt(2t + 1) 2(n 1)p + γ(2t + 1) log+ L1/2R q1/2γϵ9/14

O n + n LR2

ϵLR2 log+ L1/2R q1/2γϵ9/14

where t = lq

2γ2qf(y0) m q

17LR2 2γ2qf(y0) is the maximum number of iterations per stage, and O(log(ϵ 1)) is the number

of stages. Note that Dh(x , ys) is uniformly bounded by R2 under Assumption 2.5, which is in the bound above.

Case 2: µ > 0

For Option II, we have t = lq

2γ2qf(y0) m q

34κ γ3q using the last relation in

Assumption 3.1. Thus the overall complexity of QASVRG (Option II) to generate an ϵ-approximate solution is

k=0 bk log+ L1/2R q1/2γϵ9/14

γn(2k + 3) 2(n 1)p + γ(2k + 3) log+ L1/2R q1/2γϵ9/14

O 2n + γnt(2t + 1) 2(n 1)p + γ(2t + 1) log+ L1/2R q1/2γϵ9/14

O n + nκ γ2n + γ3/2 κ log+ L1/2R q1/2γϵ9/14

where O(log(ϵ 1)) is the number of stages. For Option I, ϵ = 0, b = γµ

2κ. Then we obtain the following relation:

(1 + c) min L yk zk 2

= (1 + c)L3

γ3µ3 = (1 +

Thus we have O log+ (1 + c) min n L yk zk 2

b3 o = O(log+(κ7/6γ 1)). At each stage, QASVRG (Option II)

is run until (1 + γ/

2 . Let (1 + γ/

2 , and we have t

γ log 2 γq , and

t = l log1+ γ

γ log 2 γq 5 κ

γ log 2 γq . Thus the overall complexity of QASVRG (Option I) to generate an ϵ-approximate solution is

k=0 bk log+ κ7/6

8κ + γ) γ(n 1) + 8(

8κ + γ) log+ κ7/6

= O n + 40nt κ γ(n 1) + 8(

8κ + γ) log+ κ7/6

O n + nκ γ2n + γ κ log 2

where O(log(ϵ 1)) is the number of stages.

Accelerated Stochastic Optimization Methods under Quasar-convexity

E. Theoretical Extension

We analyze QASGD under a restrictive condition: strong growth condition (SGC), which is formally formulated in the following. This condition has been proposed in Schmidt & Roux (2013), Vaswani et al. (2019), and Gower et al. (2021). Schmidt & Roux (2013) derive optimal convergence rates for SGD under SGC for convex and strongly convex functions.

Assumption E.1 (SGC). Suppose i is sampled i.i.d from [n]. For some constant ρ and x X , we have

Ei fi(x) 2 ρ f(x) 2.

If f(x) = 0, then fi(x) = 0 under SGC, which implies the interpolation assumption. We derive better convergence rates for QASGD under SGC for f Qµγ.

Theorem E.2 (QASGD under SGC). Suppose Assumption 3.1 and Assumption E.1 hold, Dh(x , z0) R2, f FL, and choose any y0 Rd. Then Algorithm 1 with the choices of k = fi(xk+1) and Ak, Bk, θk specified in Table 5 satisfies

0, µ > 0. (22)

Summing both sides of (22), we conclude the convergence rate as follows:

E[f(yt) f(x )]

t E0, µ > 0. (23)

QASGD under SGC

γ-quasar-convex (µ = 0)

4ρL(k + 1)2, Bk = 1

(αk, βk, ρk, f, b, c, ϵ) 0, γ

ak , 1 ρL, f, 0, γAk

µ-strongly γ-quasar-convex (µ > 0)

Ak = (1 + γ/2ρ κ)k, Bk = µAk

(αk, βk, ρk, f, b, c, ϵ) γµ, γµBk bk , 1 ρL, f, γ µµ

Table 5. Parameter choices for QASGD under SGC

Ek+1 Ek (12) = h(zk+1) h(zk), x zk+1 Dh(zk+1, zk) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk k, x zk+1 Dh(zk+1, zk) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk k, x zk + 1

βk k, zk zk+1 µ

2 zk+1 zk 2

+ Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

(13) 1 βk k, x zk + 1 2 µβ2 k k 2 + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk k, x xk+1 + τk

βk k, yk zk + 1 2 µβ2 k k 2 + Ak+1(f(yk+1) f(xk+1))

+ (Ak+1 Ak)(f(xk+1) f(x )) + Ak(f(xk+1) f(yk))

Accelerated Stochastic Optimization Methods under Quasar-convexity

E[Ek+1 Ek] γ

βk (f(x ) f(xk+1)) + 1

βk (c(f(yk) f(xk+1)) + ϵ) + 1 2 µβ2 k E k 2

+ Ak+1E[f(yk+1) f(xk+1)] + (Ak+1 Ak)(f(xk+1) f(x )) + Ak(f(xk+1) f(yk))

ρ 2 µβ2 k f(xk+1) 2 + Ak+1E[f(yk+1) f(xk+1)] + (Ak+1 Ak)

(16) ρ 2 µβ2 k Ak+1

f(xk+1) 2 + (Ak+1 Ak)

2 ϵ Ak+1 Ak

Case 2: µ > 0

Ek+1 Ek (12) = bk Dh(x , zk+1) Bk h(zk+1) h(zk), x zk+1 Bk Dh(zk+1, zk) + Ak+1(f(yk+1) f(x ))

Ak(f(yk) f(x ))

= bk Dh(x , zk+1) + αk Bk

βk h(zk+1) h(xk+1), x zk+1 + Bk

βk k, x zk+1 Bk Dh(zk+1, zk)

+ Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

(12) = bk αk Bk

Dh(x , zk+1) + αk Bk

βk (Dh(x , xk+1) Dh(zk+1, xk+1)) + Bk

βk k, zk zk+1 Bk Dh(zk+1, zk) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

(8) bk αk Bk

Dh(x , zk+1) + αk Bk

βk Dh(x , xk+1) µαk Bk

2βk zk+1 xk+1 2 + Bk

βk k, zk zk+1 µBk

2 zk+1 zk 2 + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

Dh(x , zk+1) + αk Bk

βk Dh(x , xk+1) µαk Bk

2βk zk+1 xk+1 2 + Bk

βk k, x xk+1

c(fi(yk) fi(xk+1)) + b xk+1 zk 2 + Bk

βk k, zk zk+1 µBk

2 zk+1 zk 2

+ Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk Dh(x , xk+1) + Bk

βk k, zk zk+1 + µαk Bk

zk+1 zk 2 + Bk

βk k, x xk+1

+ Ak(f(yk) f(xk+1)) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

The first equality and the third equality follows from Lemma A.1; the second equality follows from mirror descent. The first inequality follows from the strong convexity of h, and the second inequality follows from (15) (Lemma A.3). With the choice of αk and βk, we have αk Bk/βk = bk, which explains the last inequality. Moreover, with the choice of Bk and Observation A.6, we have αk βk = bk Bk = γ 2 κ γ

Combined with the initial bound and the relation above, we obtain the following bound:

Ek+1 Ek αk Bk

βk Dh(x , xk+1) + Bk

βk k, zk zk+1 + µαk Bk

zk+1 zk 2 + Bk

βk k, x xk+1

+ Ak(f(yk) f(xk+1)) + Ak+1(f(yk+1) f(x )) Ak(f(yk) f(x ))

βk Dh(x , xk+1) + Bk

βk k, zk zk+1 µBk

4 zk+1 zk 2 + Bk

βk k, x xk+1

+ Ak+1(f(yk+1) f(xk+1)) + ak(f(xk+1) f(x ))

(13)(9) αk Bk

βk Dh(x , xk+1) + Bk

µβ2 k k 2 + γBk

βk (fi(x ) fi(xk+1) µDh(x , xk+1))

+ Ak+1(f(yk+1) f(xk+1)) + ak(f(xk+1) f(x ))

Accelerated Stochastic Optimization Methods under Quasar-convexity

Taking the expectation, we obtain

E[Ek+1 Ek] αk Bk

βk Dh(x , xk+1) + Bk

µβ2 k E k 2 + γBk

βk (f(x ) f(xk+1) µDh(x , xk+1)))

+ Ak+1E[f(yk+1) f(xk+1)] + ak(f(xk+1) f(x ))

µβ2 k E k 2 + Ak+1E[f(yk+1) f(xk+1)]

F. Additional Simulation Results

Figure 2. Evaluation on two different LDS instances with random seed in {12, 36}. We choose ϵ = 10 2, the stepsize to be 1 10 6, 1 10 5 for SGD, L = 1 107, 5 106 for QASGD and L = 3 106, 1 105 for QASVRG in LDS4 and LDS5. The flat line in the third column means the loss blows up to infinity with this choice of stepsize.

We provide a contrived experiment by constructing an objective satisfying all the assumptions required. Consider the following optimization problem

i=1 gγ(bia T i x) + µ

2 , 0 x 1, 0, x 0,

where (ai, bi)i=1,...,n is training data with ai Rd and bi {+1, 1}; µ 0, and f(x) satisfies Assumption 2.4. f(x) is µ-strongly γ-quasar-convex and L-smooth by properties of quasar-convex functions introduced in (Hinder et al. (2020), D.3), where L = Pn i=1 ai /n + 0.5µ. We choose γ {0.5, 0.8} and normalize each ai for simplicity so that L = 1 + 0.5µ. Note that each gγ(bia T i x) has at least one common minimizer. Therefore, Assumption 2.7 is also satisfied by f. We use the following multi-classification dataset from Dua & Graff (2017), which we treat as binary classification datasets. We have n = 1372 and d = 4 according to the dataset. We set ϵ = 10 2 when µ = 0, and set ϵ = 10 3 when µ > 0. We generate the error bar the same way as simulations in section 4. Figure 4 shows that QASGD enjoys faster convergence than SGD while QASVRG enjoys fast convergence and lower complexity than QAGD and GD. When µ = 0.002, Figure 4 also shows the superiority of QASVRG (Option I) in terms of convergence speed and complexity given κϵ 0.5 < 1. We also compare our methods with GD, QAGD and SGD on solving empirical risks of GLM with logistic link function

Accelerated Stochastic Optimization Methods under Quasar-convexity

Figure 3. Evaluation of QASGD with different σ on three LDS instances

Figure 4. Evaluation of each algorithm on problem (24)

σ(z) = (1 + exp( z)) 1. Consider the following optimization problem

h σ w Txi yi 2i)

where xi N(0, I), w N(0, I) and yi = σ w T xi for each i [n]. In our experiment, we choose n = 5000, d = 50 and the initial iterate w0 N(0, 100I). Since it is intractable to compute the parameter of quasar-convexity γ and smoothness L, we evaluate our methods with γ = 0.5 and L = 105 by extensive grid search.

Figure 5. Evaluation of each algorithm on problem (25)