# escaping_saddle_points_in_constrained_optimization__2e129a89.pdf Escaping Saddle Points in Constrained Optimization Aryan Mokhtari MIT Cambridge, MA 02139 aryanm@mit.edu Asuman Ozdaglar MIT Cambridge, MA 02139 asuman@mit.edu Ali Jadbabaie MIT Cambridge, MA 02139 jadbabai@mit.edu In this paper, we study the problem of escaping from saddle points in smooth nonconvex optimization problems subject to a convex set C. We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set C is simple for a quadratic objective function. Specifically, our results hold if one can find a -approximate solution of a quadratic program subject to C in polynomial time, where < 1 is a positive constant that depends on the structure of the set C. Under this condition, we show that the sequence of iterates generated by the proposed framework reaches an ( , γ)-second order stationary point (SOSP) in at most O(max{ 2, 3γ 3}) iterations. We further characterize the overall complexity of reaching an SOSP when the convex set C can be written as a set of quadratic constraints and the objective function Hessian has a specific structure over the convex set C. Finally, we extend our results to the stochastic setting and characterize the number of stochastic gradient and Hessian evaluations to reach an ( , γ)-SOSP. 1 Introduction There has been a recent revival of interest in non-convex optimization, due to obvious applications in machine learning. While the modern history of the subject goes back six or seven decades, the recent attention to the topic stems from new applications as well as availability of modern analytical and computational tools, providing a new perspective on classical problems. Following this trend, in this paper we focus on the problem of minimizing a smooth nonconvex function over a convex set as follows: minimize f(x), subject to x 2 C, (1) where x 2 Rd is the decision variable, C Rd is a closed convex set, and f : Rd ! R is a twice continuously differentiable function over C. It is well known that finding a global minimum of Problem (1) is hard. Equally well-known is the fact that for certain nonconvex problems, all local minimizers are global. These include, for example, matrix completion [24], phase retrieval [42], and dictionary learning [43]. For such problems, finding a global minimum of Problem (1) reduces to the problem of finding one of its local minima. Given the well-known hardness results in finding stationary points, recent focus has shifted in characterizing approximate stationary points. When the objective function f is convex, finding an -first-order stationary point is often sufficient since it leads to finding an approximate local (and hence global) minimum. However, in the nonconvex setting, even when the problem is unconstrained, i.e., C = Rd, convergence to a first-order stationary point (FOSP) is not enough as the critical point to which convergence is established might be a saddle point. It is therefore natural to look at higher order derivatives and search for a second-order stationary points. Indeed, under the assumption that all the saddle points are strict (formally defined later), in both unconstrained and constrained settings, convergence to a second-order stationary point (SOSP) implies convergence to a local minimum. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada. While convergence to an SOSP has been thoroughly investigated in the recent literature for the unconstrained setting, the overall complexity for the constrained setting has not been studied yet. Contributions. Our main contribution is to propose a generic framework which generates a sequence of iterates converging to an approximate second-order stationary point for the constrained nonconvex problem in (1), when the convex set C has a specific structure that allows for approximate minimization of a quadratic loss over the feasible set. The proposed framework consists of two main stages: First, it utilizes first-order information to reach a first-order stationary point; next, it incorporates second-order information to escape from a stationary point if it is a local maximizer or a strict saddle point. We show that the proposed approach leads to an ( , γ)-second-order stationary point (SOSP) for Problem (1) (check Definition 1). The proposed approach utilizes advances in constant-factor optimization of nonconvex quadratic programs [46, 22, 44] that find a -approximate solution over C in polynomial time, where is a positive constant smaller than 1 that depends on the structure of C. When such approximate solution exists, the sequence of iterates generated by the proposed framework reaches an ( , γ)-SOSP of Problem (1) in at most O(max{ 2, 3γ 3}) iterations. We show that quadratic constraints satisfy the required condition for the convex set C if the objective function Hessian r2f has a specific structure over the convex set C (formally described later). For this case, we show that it is possible to achieve an ( , γ)-SOSP after at most O(max{ 2, d3m7γ 3}) arithmetic operations, where d is the dimension of the problem and is the number of required arithmetic operations to solve a linear program over C or to project a point onto C. We further extend our results to the stochastic setting and show that we can reach an ( , γ)-SOSP after computing at most O(max{ 4, 2 4γ 4, 7γ 7}) stochastic gradients and O(max{ 2 3γ 3, 5γ 5}) stochastic Hessians. 1.1 Related work Unconstrained case. The rich literature on nonconvex optimization provides a plethora of algorithms for reaching stationary points of a smooth unconstrained minimization problem. Convergence to first-order stationary points (FOSP) has been widely studied for both deterministic [35, 1, 7 10] and stochastic settings [39, 38, 3, 32]. Stronger results which indicate convergence to an SOSP are also established. Numerical optimization methods such as trust-region methods [13, 19, 33] and cubic regularization algortihms [36, 11, 12] can reach an approximate second-order stationary point in a finite number of iterations; however, typically the computational complexity of each iteration could be relatively large due to the cost of solving trust-region or regularized cubic subproblems. Recently, a new line of research has emerged that focuses on the overall computational cost to achieve an SOSP. These results build on the idea of escaping from strict saddle points with perturbing the iterates by injecting a properly chosen noise [23, 29, 30], or by updating the iterates using the eigenvector corresponding to the smallest eigenvalue of the Hessian [7, 2, 45, 41, 1, 40, 37]. Constrained case. Asymptotic convergence to first-order and second-order stationary points for the constrained optimization problem in (1) has been studied in the numerical optimization community [6, 18, 21, 20]. Recently, finite-time analysis for convergence to an FOSP of the generic smooth constrained problem in (1) has received a lot of attention. In particular, [31] shows that the sequence of iterates generated by the update of Frank-Wolfe converges to an -FOSP after O( 2) iterations. The authors of [26] consider norm of gradient mapping as a measure of non-stationarity and show that the projected gradient method has the same complexity of O( 2). Similar result for the accelerated projected gradient method is also shown [25]. Adaptive cubic regularization methods in [14 16] improve these results using second-order information and obtain an -FOSP of Problem (1) after at most O( 3/2) iterations. Finite time analysis for convergence to an SOSP has also been studied for linear constraints. In particular, [5] studies convergence to an SOSP of (1) when the set C is a linear constraint of the form x 0 and propose a trust region interior point method that obtains an ( , p )-SOSP in O( 3/2) iterations. The work in [27] extends their results to the case that the objective function is potentially not differentiable or not twice differentiable on the boundary of the feasible region. The authors in [17] focus on the general convex constraint case and introduce a trust region algorithm that requires O( 3) iterations to obtain an SOSP; however, each iteration of their proposed method requires access to the exact solution of a nonconvex quadratic program (finding its global minimum) which, in general, could be computationally prohibitive. To the best of our knowledge, our paper provides the first finite-time overall computational complexity analysis for reaching an SOSP of Problem (1). 2 Preliminaries and Definitions In the case of unconstrained minimization of the objective function f, the first-order and second-order necessary conditions for a point x to be a local minimum of that are defined as rf(x ) = 0d and r2f(x ) 0d d, respectively. If a point satisfies these conditions it is called a second-order stationary point (SOSP). If the second condition becomes strict, i.e., r2f(x) 0, then we recover the sufficient conditions for a local minimum. However, to derive finite time convergence bounds for achieving an SOSP, these conditions should be relaxed. In other words, the goal should be to find an approximate SOSP where the approximation error can be arbitrarily small. For the case of unconstrained minimization, a point x is called an ( , γ)-second-order stationary point if it satisfies krf(x )k and r2f(x ) γId, where and γ are arbitrary positive constants. To study the constrained setting, we first state the necessary conditions for a local minimum of Problem (1). Proposition 1 ([4]). If x 2 C is a local minimum of the function f over the convex set C, then rf(x )>(x x ) 0, for all x 2 C, (2) (x x )>r2f(x )(x x ) 0, for all x 2 C s. t. rf(x )>(x x ) = 0. (3) The conditions in (2) and (3) are the first-order and second-order necessary optimality conditions, respectively. By making the inequality in (3) strict, i.e., (x x )>r2f(x )(x x ) > 0, we recover the sufficient conditions for a local minimum when C is a polyhedral [4]. Further, if the inequality in (3) is replaced by (x x )>r2f(x )(x x ) δkx x k2 for some δ > 0, we obtain the sufficient conditions for a local minimum of Problem (1) for any convex constraint C; see [4]. If a point x satisfies the conditions in (2) and (3) it is an SOSP of Problem (1). As in the unconstrained setting, the first-order and second-order optimality conditions may not be satisfied in finite number of iterations, and we focus on finding an approximate SOSP. Definition 1. Recall the twice continuously differentiable function f : Rd ! R and the convex closed set C Rd introduced in Problem (1). We call x 2 C an ( , γ)-second order stationary point of Problem (1) if the following conditions are satisfied. rf(x )>(x x ) , for all x 2 C, (4) (x x )>r2f(x )(x x ) γ, for all x 2 C s. t. rf(x )>(x x ) = 0. (5) If a point only satisfies the first condition, we call it an -first order stationary point. We further formally define strict saddle points for the constrained optimization problem in (1). Definition 2. A point x 2 C is a δ-strict saddle point of Problem (1) if (i) for all x 2 C the condition rf(x )>(x x ) 0 holds, and (ii) there exists a point y such that (y x )>r2f(x )(y x ) < δ, y 2 C and rf(x )>(y x ) = 0. (6) According to Definitions 1 and 2 if all saddle points are δ-strict and γ δ, any ( , γ)-SOSP of Problem (1) is an approximate local minimum. We emphasize that in this paper we do not assume that all saddles are strict to prove convergence to an SOSP. We formally defined strict saddles just to clarify that if all the saddles are strict then convergence to an approximate SOSP is equivalent to convergence to an approximation local minimum. Our goal throughout the rest of the paper is to design an algorithm which finds an ( , γ)-SOSP of Problem (1). To do so, we first assume the following conditions are satisfied. Assumption 1. The gradients rf are L-Lipschitz continuous over the set C, i.e., for any x, x 2 C, krf(x) rf( x)k Lkx xk. (7) Assumption 2. The Hessians r2f are M-Lipschitz continuous over the set C, i.e., for any x, x 2 C kr2f(x) r2f( x)k Mkx xk. (8) Assumption 3. The diameter of the compact convex set C is upper bounded by a constant D, i.e., max x, x2C{kx ˆxk} D. (9) 3 Main Result In this section, we introduce a generic framework to reach an ( , γ)-SOSP of the non-convex function f over the convex set C, when C has a specific structure as we describe below. In particular, we focus on the case when we can solve a quadratic program (QP) of the form minimize x>Ax + b>x + c subject to x 2 C, (10) up to a constant factor 1 in a finite number of arithmetic operations. Here, A 2 Rd is a symmetric matrix, b 2 Rd is a vector, and c 2 R is a scalar. To clarify the notion of solving a problem up to a constant factor , consider x as a global minimizer of (10). Then, we say Problem (10) is solved up to a constant factor 2 (0, 1] if we have found a feasible solution x 2 C such that x >Ax + b>x + c x>A x + b> x + c (x >Ax + b>x + c). (11) Note that here w.l.o.g. we have assumed that the optimal objective function value x >Ax +b>x +c is non-positive. Larger constant implies that the approximate solution is more accurate. If x satisfies the condition in (11), we call it a -approximate solution of Problem (10). Indeed, if = 1 then x is a global minimizer of Problem (10). In Algorithm 1, we introduce a generic framework that achieves an ( , γ)-SOSP of Problem (1) whose running time is polynomial in 1, γ 1, 1 and d, when we can find a -approximate solution of a quadratic problem of the form (10) in a time that is polynomial in d. The proposed scheme consists of two major stages. In the first phase, as mentioned in Steps 2-4, we use a first-order update, i.e., a gradient-based update, to find an -FOSP, i.e., we update the decision variable x according to a first-order update until we reach a point xt that satisfies the condition rf(xt)>(x xt) , for all x 2 C. (12) In Section 4, we study in detail projected gradient descent and conditional gradient algorithms for the first order phase of the proposed framework. Interestingly, both of these algorithms require at most O( 2) iterations to reach an -first order stationary point. The second stage of the proposed scheme uses second-order information of the objective function f to escape from the stationary point if it is a local maximum or a strict saddle point. To be more precise, if we assume that xt is a feasible point satisfying the condition (12), we then aim to find a descent direction by solving the following quadratic program minimize q(u) := (u xt)>r2f(xt)(u xt) subject to u 2 C, rf(xt)>(u xt) = 0, (13) up to a constant factor where 2 (0, 1]. To be more specific, if we define q(u ) as the optimal objective function value of the program in (13), we focus on the cases that we can obtain a feasible point ut which is a -approximate solution of Problem (13), i.e., ut 2 C satisfies the constraints in (13) and q(u ) q(ut) q(u ). (14) The problem formulation in (13) can be transformed into the quadratic program in (10); see Section 5 for more details. Note that the constant is independent of , γ, and d and only depends on the structure of the convex set C. For instance, if C is defined in terms of m quadratic constraints one can find a = m 2 approximate solution of (13) after at most O(md3) arithmetic operations (Section 5). After computing a feasible point ut satisfying the condition in (14), we check the quadratic objective function value at the point ut, and if the inequality q(ut) < γ holds, we follow the update xt+1 = (1 σ)xt + σut, (15) where σ is a positive stepsize. Otherwise, we stop the process and return xt as an ( , γ)-second order stationary point of Problem (1). To check this claim, note that Algorithm 1 stops if we reach a point xt that satisfies the first-order stationary condition rf(xt)>(x xt) , and the objective function value for the -approximate solution of the quadratic subproblem is larger than γ, i.e., q(ut) γ. The second condition alongside with the fact that q(ut) satisfies (14) implies that q(u ) γ. Therefore, for any x 2 C and rf(xt)>(x xt) = 0, it holds that (x xt)>r2f(xt)(x xt) γ. (16) Algorithm 1 Generic framework for escaping saddles in constrained optimization Require: Stepsize σ > 0. Initialize x0 2 C 1: for t = 1, 2, . . . do 2: if xt is not an -first order stationary point then 3: Compute xt+1 using first-order information (Frank-Wolfe or projected gradient descent) 4: else 5: Find ut: a -approximate solution of (13) 6: if q(ut) < γ then 7: Compute the updated variable xt+1 = (1 σ)xt + σut; 8: else 9: Return xt and stop. 10: end if 11: end if 12: end for These two observations show that the outcome of the proposed framework in Algorithm 1 is an ( , γ)-SOSP of Problem (1). Now it remains to characterize the number of iterations that Algorithm 1 needs to perform before reaching an ( , γ)-SOSP which we formally state in the following theorem. Theorem 1. Consider the optimization problem defined in (1). Suppose that the conditions in Assumptions 1-3 are satisfied. If in the first-order stage, i.e., Steps 2-4, we use the update of Frank-Wolfe or projected gradient descent, the generic framework proposed in Algorithm 1 finds an ( , γ)-second-order stationary point of Problem (1) after at most O(max{ 2, 3γ 3}) iterations. The result in Theorem 1 shows that if the convex constraint C is such that one can solve the quadratic subproblem in (13) -approximately, then the proposed generic framework finds an ( , γ)-SOSP point of Problem (1) after at most O( 2) first-order and O( 3γ 3) second-order updates. To prove the claim in Theorem 1, we first review first-order conditional gradient and projected gradient algorithms and show that if the current iterate is not a first-order stationary point, by following either of these updates the objective function value decreases by a constant of O( 2) (Section 4). We then focus on the second stage of Algorithm 1 which corresponds to the case that the current iterate is an -FOSP and we need to solve the quadratic program in (13) approximately (Section 5). In this case, we show that if the iterate is not an ( , γ)-SOSP, by following the update in (15) the objective function value decreases at least by a constant of O( 3γ3). Finally, by combining these two results it can be shown that Algorithm 1 finds an ( , γ)-SOSP after at most O(max{ 2, 3γ 3}) iterations. 4 First-Order Step: Convergence to a First-Order Stationary Point In this section, we study two different first-order methods for the first stage of Algorithm 1. The result in this section can also be independently used for convergence to an FOSP of Problem (1) satisfying rf(x )>(x x ) , for all x 2 C, (17) where > 0 is a positive constant. Although for Algorithm 1 we assume that C has a specific structure as mentioned in (10), the results in this section hold for any closed and compact convex set C. To keep our result as general as possible, in this section, we study both conditional gradient and projected-based methods when they are used in the first-stage of the proposed generic framework. 4.1 Conditional gradient update The conditional gradient (Frank-Wolfe) update has two steps. We first solve the linear program vt = argmax { rf(xt)>v}. (18) Then, we compute the updated variable xt+1 according to the update xt+1 = (1 )xt + vt, (19) where is a stepsize. In the following proposition, we show that if the current iterate is not an -first order stationary point, then by updating the variable according to (18)-(19) the objective function value decreases. The proof of the following proposition is adopted from [31]. Proposition 2. Consider the optimization problem in (1). Suppose Assumptions 1 and 3 hold. Set the stepsize in (19) to = /D2L. Then, if the iterate xt at step t is not an -first order stationary point, the objective function value at the updated variable xt+1 satisfies the inequality f(xt+1) f(xt) 2 The result in Proposition 2 shows that by following the update of the conditional gradient method the objective function value decreases by O( 2), if an -FOSP is not achieved. Remark 1. In step 3 of Algorithm 1 we first check if xt is an -FOSP. This can be done by evaluating x2C{rf(xt)>(x xt)} = max x2C { rf(xt)>x} + rf(xt)>xt (21) and comparing the optimal value with . Note that the linear program in (21) is the same as the one in (18). Therefore, by checking the first-order optimality condition of xt, the variable vt is already computed, and we need to solve only one linear program per iteration. 4.2 Projected gradient update The projected gradient descent (PGD) update consists of two steps: (i) descending through the gradient direction and (ii) projecting the updated variable onto the convex constraint set. These two steps can be combined together and the update can be explicitly written as xt+1 = C{xt rf(xt)}, (22) where C(.) is the Euclidean projection onto the convex set C and is a positive stepsize. In the following proposition, we first show that by following the update of PGD the objective function value decreases by a constant until we reach an - FOSP. Further, we show that the number of required iterations for PGD to reach an -FOSP is of O( 2). Proposition 3. Consider Problem (1). Suppose Assumptions 1 and 3 are satisfied. Further, assume that the gradients rf(x) are uniformly bounded by K for all x 2 C. If the stepsize of the projected gradient descent method defined in (22) is set to = 1/L the objective function value decreases by f(xt+1) f(xt) 2L 2(K + LD)2 , (23) Moreover, iterates reach a first-order stationary point satisfying (17) after at most O( 2) iterations. Proposition 3 shows that by following the update of PGD the function value decreases by O( 2) until we reach an -FOSP. It further shows PGD obtains an -FOSP satisfying (17) after at most O( 2) iterations. To the best of our knowledge, this result is also novel, since the only convergence guarantee for PGD in [26] is in terms of number of iterations to reach a point with a gradient mapping norm less than , while our result characterizes number of iterations to satisfy (17). Remark 2. To use the PGD update in the first stage of Algorithm 1 one needs to define a criteria to check if xt is an -FOSP or not. However, in PGD we do not solve the linear program minx2C{rf(xt)>(x xt)}. This issue can be resolved by checking the condition kxt xt+1k /(K + LD) which is a sufficient condition for the condition in (17). In other words, if this condition holds we stop and xt is an -FOSP; otherwise, the result in (23) holds and the function value decreases. For more details please check the proof of Proposition 3. 5 Second-Order Step: Escape from Saddle Points In this section, we study the second stage of the framework in Algorithm 1 which corresponds to the case that the current iterate is an -FOSP. Note that when we reach a critical point the goal is to find a feasible point u 2 C in the tangent space rf(xt)>(u xt) = 0 that makes the inner product (u xt)>r2f(xt)(u xt) smaller than γ. To achieve this goal we need to check the minimum value of this inner product over the constraints, i.e., we need to solve the quadratic program in (13) up to a constant factor 2 (0, 1]. In the following proposition, we show that the updated variable according to (15) decreases the objective function value if the condition q(ut) < γ holds. Proposition 4. Consider the quadratic program in (13). Let ut be a -approximate solution for quadratic subproblem in (13). Suppose that Assumptions 2 and 3 hold. Further, set the stepsize σ = γ/MD3. If the quadratic objective function value q evaluated at ut satisfies the condition q(ut) < γ, then the updated variable according to (15) satisfies the inequality f(xt+1) f(xt) 3γ3 3M 2D6 . (24) The only unanswered question is how to solve the quadratic subproblem in (13) up to a constant factor 2 (0, 1]. For general C, the quadratic subproblem could be NP-hard [34]; however, for some special choices of the convex constraint C, this quadratic program (QP) can be solved either exactly or approximately up to a constant factor. In the following section, we focus on the quadratic constraint case, but indeed there are other classes of constraints that satisfy our required condition. 5.1 Quadratic constraints case In this section, we focus on the case where the constraint set C is defined as the intersection of m ellipsoids centered at the origin.1 In particular, assume that the set C is given by C := {x 2 Rd | x>Qix 1, for all i = 1, . . . , m}, (25) where Qi 2 Sd +. Under this assumption, the QP in (13) can be written as u (u xt)>r2f(xt)(u xt) s.t. u>Qiu 1, for i = 1, . . . , m and rf(xt)>(u xt) = 0. (26) Note that the equality constraint rf(xt)>(u xt) = 0 does not change the hardness of the problem and can be easily eliminated. To do so, first define a new optimization variable z := u xt to obtain z z>r2f(xt)z s.t. (z + xt)>Qi(z + xt) 1, for i = 1, . . . , m and rf(xt)>z = 0, (27) Then, find a basis for the tangent space rf(xt)>z = 0. Indeed, using the Gramm-Schmidt procedure, we can find an orthonormal basis for the space Rd of the form {v1, . . . , vd 1, rf(xt) krf(xt)k} at the complexity of O(d3). If we define A = [v1; . . . ; vd 1] 2 Rd d 1 as the concatenation of the vectors {v1, . . . , vd 1}, then any vector z satisfying rf(xt)>z = 0 can be written as z = Ay where y 2 Rd 1. Hence, (27) is equivalent to z y>A>r2f(xt)Ay s.t. (Ay + xt)>Qi(Ay + xt) 1, for i = 1, . . . , m. (28) This procedure reduces the dimension of the problem from d to d 1. It is not hard to check that the center of ellipsoids in (28) is A>xt. By a simple change of variable Aˆy := Ay + xt we obtain z ˆy>A>r2f(xt)Aˆy 2x> t r2f(xt)Aˆy + x> t r2f(xt)xt s.t. ˆy>A>Qi Aˆy 1, for i = 1, . . . , m. (29) Define the matrices Qi := A>Qi A and Bt := A>r2f(xt)A, the vector st = 2x> t r2f(xt)A, and the scalar ct := x> t r2f(xt)xt. Using these definitions the problem reduces to z q(ˆy) := ˆy>Btˆy + s> t ˆy + ct s.t. ˆy> Qiˆy 1, for i = 1, . . . , m. (30) Note that the matrices Qi 2 Sd + are positive semidefinite, while the matrix Bt 2 Sd might be indefinite. Indeed, the optimal objective function value of the program in (30) is equal to the optimal objective function value of (26). Further, note that if we find a -approximate solution ˆy for (30), we can recover a -approximate solution u for (26) using the transformation u = Aˆy . 1To simplify the constant factor approximation we assume ellipsoids are centered at the origin. If we drop this assumption then will depend on the maximum distance between the origin and the boundary of each of the ellipsoids, e.g., see equation (6) in [44]. The program in (30) is a specific Quadratic Constraint Quadratic Program (QCQP), where all the constraints are centered at 0. For the specific case of m = 1, the duality gap of this problem is zero and simply by transferring the problem to the dual domain one can solve Problem (30) exactly. In the following proposition, we focus on the general case of m 1 and explain how to find a -approximate solution for (30). Proposition 5. Consider Problem (30) and define qmin as the minimum objective value of the problem. Based on the result in [22], there exists a polynomial time method that obtains a point ˆy q(ˆy ) 1 m2(1 + )2 qmin + 1 1 m2(1 + )2 t r2f(xt)xt (31) after at most O(d3(m log(1/δ)+log(1/ )+log d)) arithmetic operations, where δ is the ratio of the radius of the largest inscribing sphere over that of the smallest circumscribing sphere of the feasible set. Further, based on [44], using a SDP relaxation of (30) one can find a point ˆy such that t r2f(xt)xt. (32) Proof. If we define the function q as q(x) := q(x) ct, using the approaches in [22] and [44], we can find a approximate solution for minˆy q(ˆy) subject to ˆy> Qiˆy 1 for i = 1, . . . , m. In other words, we can find a point ˆy such that q(ˆy ) qmin where 0 < < 1 and qmin is the minimum objective function value of q over the constraint set which satisfies qmin = qmin ct. Replacing q(ˆy ) and qmin by their definitions and regrouping the terms imply that ˆy satisfies the condition q(ˆy ) qmin + (1 )ct. Replacing by 1 m2(1+ )2 (which is the constant factor approximation shown in [22]) leads to the claim in (31), and substituting by 1/m (which is the approximation bound in [44]) implies the result in (32). The result in Proposition 5 indicates that if x> t r2f(xt)xt is non-positive, then one can find a - approximate solution for Problem (30) and consequently Problem (26). This condition is satisfied if we assume that maxx2C x>r2f(x)x 0. For instance, for a concave minimization problem over the convex set C this condition is satisfied. In fact, it can be shown that our analysis still stands even if maxx2C x>r2f(x)x is at most O(γ). Note that this condition is significantly weaker than requiring the function to be concave when restricted to the feasible set. The condition essentially implies that the quadratic term in the Taylor expansion of the function evaluated at the origin should be negative (or not too positive). Corollary 1. Consider a convex set C which is defined as the intersection of m 1 ellipsoids centered at the origin. Further, assume that the objective function Hessian r2f satisfies the condition maxx2C x>r2f(x)x 0. Then, for = 1/m and = 1/m2, it is possible to find a -approximate solution of Problem (13) in time polynomial in m and d. By using the approach in [22], we can solve the QCQP in (29) with the approximation factor 1/m2 for m 1 at the overall complexity of O(md3) when the constraint C is defined as m convex quadratic constraints. As the total number of calls to the second-order stage is at most O( 3γ 3) = O(m6γ 3), we obtain that the total number of arithmetic operations for the secondorder stage is at most O(m7d3γ 3). The constant factor can be improved to 1/m if we solve the SDP relaxation problem suggested in [44]. 6 Stochastic Extension In this section, we focus on stochastic constrained minimization problems. Consider the optimization problem in (1) when the objective function f is defined as an expectation of a set of stochastic functions F : Rd Rr ! R with inputs x 2 Rd and 2 Rr, where is a random variable with probability distribution P. To be more precise, we consider the optimization problem minimize f(x) := E [F(x, )] , subject to x 2 C. (33) Our goal is to find a point which satisfies the necessary optimality conditions with high probability. Algorithm 2 Require: Stepsize σt > 0. Initialize x0 2 C 1: for t = 1, 2, . . . do 2: Compute vt = argmaxv2C{ d> t v} 3: if d> t (vt xt) < /2 then 4: Compute xt+1 = (1 )xt + vt 5: else 6: Find ut: a -approximate solution of minu (u xt)>Ht(u xt) s.t. u 2 C, d> t (u xt) r. 7: if q(ut) < γ/2 then 8: Compute the updated variable xt+1 = (1 σ)xt + σut; 9: else 10: Return xt and stop. 11: end if 12: end if 13: end for Consider the vector dt = (1/bg) Pbg i=1 r F(xt, i) and matrix Ht = (1/b H) Pb H i=1 r2F(xt, i) as stochastic approximations of the gradient rf(xt) and Hessian r2f(xt), respectively. Here bg and b H are the gradient and Hessian batch sizes, respectively, and the vectors i are the realizations of the random variable . In Algorithm 2, we present the stochastic variant of our proposed scheme for finding an ( , γ)-SOSP of Problem (33). Algorithm 2 differs from Algorithm 1 in using the stochastic gradients dt and Hessians Ht in lieu of the exact gradients rf(xt) and r2f(xt) Hessians. The second major difference is the inequality constraint in step 6. Here instead of using the constraint d> t (u xt) = 0 we need to use d> t (u xt) r, where r > 0 is a properly chosen constant. This modification is needed to ensure that if a point satisfies this constraint with high probability it also satisfies the constraint rf(xt)>(u xt) = 0. This modification implies that we need to handle a linear inequality constraint instead of the linear equality constraint, which is computationally manageable for some constraints including the case that C is a single ball constraint [28]. To prove our main result we assume that the following conditions also hold. Assumption 4. The variance of the stochastic gradients and Hessians are uniformly bounded by constants 2 and 2, respectively, i.e., for any x 2 C and we can write kr F(x, ) rf(x)k2 kr2F(x, ) r2f(x)k2 The required conditions in Assumption 4 ensure that the variances of stochastic gradients and Hessians are uniformly bounded above, which are customary in stochastic optimizaiton. In the following theorem, we characterize the iteration complexity of Algorithm 2 to reach an ( , γ)-SOSP of Problem (33) with high probability. Theorem 2. Consider the optimization problem in (33). Suppose the conditions in Assumptions 1-4 are satisfied. If the batch sizes are bg = O(max{ 4γ 4, 2}) and b H = O( 2γ 2) and we set the parameter r = O( 2γ2), then the outcome of the proposed framework outlined in Algorithm 2 is an ( , γ)-second-order stationary point of Problem (33) with high probability. Further, the total number of iterations to reach such point is at most O(max{ 2, 3γ 3}) with high probability. The result in Theorem 2 indicates that the total number of iterations to reach an ( , γ)-SOSP is at most O(max{ 2, 3γ 3}). As each iteration at most requires O(max{ 4γ 4, 2}) stochastic gradient and O( 2γ 2) stochastic Hessian evaluations, the total number of stochastic gradient and Hessian computations to reach an ( , γ)-SOSP is of O(max{ 2 4γ 4, 4, 7γ 7}) and O(max{ 2 3γ 3, 5γ 5}), respectively. Acknowledgment This work was supported by DARPA Lagrange and ONR BRC Program. The authors would like to thank Yue Sun for pointing out a missing condition in the first draft of the paper. [1] N. Agarwal, Z. Allen Zhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate local minima faster than gradient descent. In STOC, pages 1195 1199, 2017. [2] Z. Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. Co RR, abs/1708.08694, 2017. [3] Z. Allen Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In ICML, pages 699 707, 2016. [4] D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999. [5] W. Bian, X. Chen, and Y. Ye. Complexity analysis of interior point algorithms for non-lipschitz and nonconvex minimization. Math. Program., 149(1-2):301 327, 2015. [6] J. V. Burke, J. J. Mor e, and G. Toraldo. Convergence properties of trust region methods for linear and convex constraints. Math. Program., 47(1-3):305 336, 1990. [7] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for non-convex optimization. Co RR, abs/1611.00756, 2016. [8] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. convex until proven guilty : Dimension-free acceleration of gradient descent on non-convex functions. In ICML, pages 654 663, 2017. [9] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points i. ar Xiv preprint ar Xiv:1710.11606, 2017. [10] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points ii: First-order methods. ar Xiv preprint ar Xiv:1711.00841, 2017. [11] C. Cartis, N. Gould, and P. Toint. Adaptive cubic regularisation methods for unconstrained optimization. part I: motivation, convergence and numerical results. Math. Program., 127(2):245 295, 2011. [12] C. Cartis, N. Gould, and P. Toint. Adaptive cubic regularisation methods for unconstrained optimization. part II: worst-case functionand derivative-evaluation complexity. Math. Program., 130(2):295 319, 2011. [13] C. Cartis, N. Gould, and P. Toint. Complexity bounds for second-order optimality in unconstrained optimization. J. Complexity, 28(1):93 108, 2012. [14] C. Cartis, N. Gould, and P. Toint. An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis, 32(4): 1662 1695, 2012. [15] C. Cartis, N. Gould, and P. Toint. On the evaluation complexity of cubic regularization methods for potentially rank-deficient nonlinear least-squares problems and its relevance to constrained nonlinear optimization. SIAM J. Opt., 23(3):1553 1574, 2013. [16] C. Cartis, N. Gould, and P. Toint. On the evaluation complexity of constrained nonlinear least-squares and general constrained nonlinear optimization using second-order methods. SIAM Journal on Numerical Analysis, 53(2):836 851, 2015. [17] C. Cartis, N. Gould, and P. Toint. Second-order optimality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear optimization. Foundations of Computational Mathematics, pages 1 35, 2017. [18] A. R. Conn, N. Gould, A. Sartenaer, and P. Toint. Global convergence of a class of trust region algorithms for optimization using inexact projections on convex constraints. SIAM J. on Opt., 3(1):164 221, 1993. [19] F. E. Curtis, D. P. Robinson, and M. Samadi. A trust region algorithm with a worst-case iteration complexity of \mathcal {O}(\epsilonˆ{-3/2}) for nonconvex optimization. Math. Program., 162(1-2):1 32, 2017. [20] G. Di Pillo, S. Lucidi, and L. Palagi. Convergence to second-order stationary points of a primal-dual algorithm model for nonlinear programming. Mathematics of Operations Research, 30(4):897 915, 2005. [21] F. Facchinei and S. Lucidi. Convergence to second order stationary points in inequality constrained optimization. Mathematics of Operations Research, 23(3):746 766, 1998. [22] M. Fu, Z.-Q. Luo, and Y. Ye. Approximation algorithms for quadratic programming. Journal of combina- torial optimization, 2(1):29 50, 1998. [23] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. In COLT, pages 797 842, 2015. [24] R. Ge, J. Lee, and T. Ma. Matrix completion has no spurious local minimum. In NIPS, pages 2973 2981, [25] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program., 156(1-2):59 99, 2016. [26] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program., 155(1-2):267 305, 2016. [27] G. Haeser, H. Liu, and Y. Ye. Optimality condition and complexity analysis for linearly-constrained optimization without differentiability on the boundary. Math. Program., pages 1 37, 2017. [28] V. Jeyakumar and G. Li. Trust-region problems with linear inequality constraints: exact SDP relaxation, global optimality and robust optimization. Mathematical Programming, 147(1-2):171 206, 2014. [29] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. In ICML, pages 1724 1732, 2017. [30] C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. Co RR, abs/1711.10456, 2017. [31] S. Lacoste-Julien. Convergence rate of Frank-Wolfe for non-convex objectives. ar Xiv preprint ar Xiv:1607.00345, 2016. [32] L. Lei, C. Ju, J. Chen, and M. I. Jordan. Non-convex finite-sum optimization via SCSG methods. In Advances in Neural Information Processing Systems 30, pages 2345 2355, 2017. [33] J. M. Mart ınez and M. Raydan. Cubic-regularization counterpart of a variable-norm trust-region method for unconstrained minimization. J. Global Optimization, 68(2):367 385, 2017. [34] K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Math. Program., 39(2):117 129, 1987. [35] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. [36] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Math. Program., 108(1):177 205, 2006. [37] S. Paternain, A. Mokhtari, and A. Ribeiro. A second order method for nonconvex optimization. ar Xiv preprint ar Xiv:1707.08028, 2017. [38] S. J. Reddi, A. Hefny, S. Sra, B. P oczos, and A. J. Smola. Stochastic variance reduction for nonconvex optimization. In ICML, pages 314 323, 2016. [39] S. J. Reddi, S. Sra, B. P oczos, and A. J. Smola. Fast incremental method for smooth nonconvex optimization. In IEEE Conference on Decision and Control, CDC, pages 1971 1977, 2016. [40] S. J. Reddi, M. Zaheer, S. Sra, B. P oczos, F. Bach, R. Salakhutdinov, and A. J. Smola. A generic approach for escaping saddle points. In AISTATS, pages 1233 1242, 2018. [41] C. W. Royer and S. J. Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. ar Xiv preprint ar Xiv:1706.03131, 2017. [42] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. In IEEE International Symposium on Information Theory, ISIT 2016, pages 2379 2383, 2016. [43] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Information Theory, 63(2):853 884, 2017. [44] P. Tseng. Further results on approximating nonconvex quadratic optimization by semidefinite programming relaxation. SIAM Journal on Optimization, 14(1):268 283, 2003. [45] Y. Xu and T. Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. ar Xiv preprint ar Xiv:1711.01944, 2017. [46] Y. Ye. On affine scaling algorithms for nonconvex quadratic programming. Math. Program., 56(1-3): 285 300, 1992.