# escaping_saddle_points_in_constrained_optimization__2e129a89.pdf

Escaping Saddle Points in Constrained Optimization

Aryan Mokhtari

MIT Cambridge, MA 02139

aryanm@mit.edu

Asuman Ozdaglar

MIT Cambridge, MA 02139

asuman@mit.edu

Ali Jadbabaie

MIT Cambridge, MA 02139

jadbabai@mit.edu

In this paper, we study the problem of escaping from saddle points in smooth nonconvex optimization problems subject to a convex set C. We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set C is simple for a quadratic objective function. Speciﬁcally, our results hold if one can ﬁnd a -approximate solution of a quadratic program subject to C in polynomial time, where < 1 is a positive constant that depends on the structure of the set C. Under this condition, we show that the sequence of iterates generated by the proposed framework reaches an ( , γ)-second order stationary point (SOSP) in at most O(max{ 2, 3γ 3}) iterations. We further characterize the overall complexity of reaching an SOSP when the convex set C can be written as a set of quadratic constraints and the objective function Hessian has a speciﬁc structure over the convex set C. Finally, we extend our results to the stochastic setting and characterize the number of stochastic gradient and Hessian evaluations to reach an ( , γ)-SOSP.

1 Introduction

There has been a recent revival of interest in non-convex optimization, due to obvious applications in machine learning. While the modern history of the subject goes back six or seven decades, the recent attention to the topic stems from new applications as well as availability of modern analytical and computational tools, providing a new perspective on classical problems. Following this trend, in this paper we focus on the problem of minimizing a smooth nonconvex function over a convex set as follows:

minimize f(x), subject to x 2 C, (1)

where x 2 Rd is the decision variable, C Rd is a closed convex set, and f : Rd ! R is a twice continuously differentiable function over C. It is well known that ﬁnding a global minimum of Problem (1) is hard. Equally well-known is the fact that for certain nonconvex problems, all local minimizers are global. These include, for example, matrix completion [24], phase retrieval [42], and dictionary learning [43]. For such problems, ﬁnding a global minimum of Problem (1) reduces to the problem of ﬁnding one of its local minima.

Given the well-known hardness results in ﬁnding stationary points, recent focus has shifted in characterizing approximate stationary points. When the objective function f is convex, ﬁnding an -ﬁrst-order stationary point is often sufﬁcient since it leads to ﬁnding an approximate local (and hence global) minimum. However, in the nonconvex setting, even when the problem is unconstrained, i.e., C = Rd, convergence to a ﬁrst-order stationary point (FOSP) is not enough as the critical point to which convergence is established might be a saddle point. It is therefore natural to look at higher order derivatives and search for a second-order stationary points. Indeed, under the assumption that all the saddle points are strict (formally deﬁned later), in both unconstrained and constrained settings, convergence to a second-order stationary point (SOSP) implies convergence to a local minimum.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada.

While convergence to an SOSP has been thoroughly investigated in the recent literature for the unconstrained setting, the overall complexity for the constrained setting has not been studied yet.

Contributions. Our main contribution is to propose a generic framework which generates a sequence of iterates converging to an approximate second-order stationary point for the constrained nonconvex problem in (1), when the convex set C has a speciﬁc structure that allows for approximate minimization of a quadratic loss over the feasible set. The proposed framework consists of two main stages: First, it utilizes ﬁrst-order information to reach a ﬁrst-order stationary point; next, it incorporates second-order information to escape from a stationary point if it is a local maximizer or a strict saddle point. We show that the proposed approach leads to an ( , γ)-second-order stationary point (SOSP) for Problem (1) (check Deﬁnition 1). The proposed approach utilizes advances in constant-factor optimization of nonconvex quadratic programs [46, 22, 44] that ﬁnd a -approximate solution over C in polynomial time, where is a positive constant smaller than 1 that depends on the structure of C. When such approximate solution exists, the sequence of iterates generated by the proposed framework reaches an ( , γ)-SOSP of Problem (1) in at most O(max{ 2, 3γ 3}) iterations.

We show that quadratic constraints satisfy the required condition for the convex set C if the objective function Hessian r2f has a speciﬁc structure over the convex set C (formally described later). For this case, we show that it is possible to achieve an ( , γ)-SOSP after at most O(max{ 2, d3m7γ 3}) arithmetic operations, where d is the dimension of the problem and is the number of required arithmetic operations to solve a linear program over C or to project a point onto C. We further extend our results to the stochastic setting and show that we can reach an ( , γ)-SOSP after computing at most O(max{ 4, 2 4γ 4, 7γ 7}) stochastic gradients and O(max{ 2 3γ 3, 5γ 5}) stochastic Hessians.

1.1 Related work

Unconstrained case. The rich literature on nonconvex optimization provides a plethora of algorithms for reaching stationary points of a smooth unconstrained minimization problem. Convergence to ﬁrst-order stationary points (FOSP) has been widely studied for both deterministic [35, 1, 7 10] and stochastic settings [39, 38, 3, 32]. Stronger results which indicate convergence to an SOSP are also established. Numerical optimization methods such as trust-region methods [13, 19, 33] and cubic regularization algortihms [36, 11, 12] can reach an approximate second-order stationary point in a ﬁnite number of iterations; however, typically the computational complexity of each iteration could be relatively large due to the cost of solving trust-region or regularized cubic subproblems. Recently, a new line of research has emerged that focuses on the overall computational cost to achieve an SOSP. These results build on the idea of escaping from strict saddle points with perturbing the iterates by injecting a properly chosen noise [23, 29, 30], or by updating the iterates using the eigenvector corresponding to the smallest eigenvalue of the Hessian [7, 2, 45, 41, 1, 40, 37].

Constrained case. Asymptotic convergence to ﬁrst-order and second-order stationary points for the constrained optimization problem in (1) has been studied in the numerical optimization community [6, 18, 21, 20]. Recently, ﬁnite-time analysis for convergence to an FOSP of the generic smooth constrained problem in (1) has received a lot of attention. In particular, [31] shows that the sequence of iterates generated by the update of Frank-Wolfe converges to an -FOSP after O( 2) iterations. The authors of [26] consider norm of gradient mapping as a measure of non-stationarity and show that the projected gradient method has the same complexity of O( 2). Similar result for the accelerated projected gradient method is also shown [25]. Adaptive cubic regularization methods in [14 16] improve these results using second-order information and obtain an -FOSP of Problem (1) after at most O( 3/2) iterations. Finite time analysis for convergence to an SOSP has also been studied for linear constraints. In particular, [5] studies convergence to an SOSP of (1) when the set C is a linear constraint of the form x 0 and propose a trust region interior point method that obtains an ( , p )-SOSP in O( 3/2) iterations. The work in [27] extends their results to the case that the objective function is potentially not differentiable or not twice differentiable on the boundary of the feasible region. The authors in [17] focus on the general convex constraint case and introduce a trust region algorithm that requires O( 3) iterations to obtain an SOSP; however, each iteration of their proposed method requires access to the exact solution of a nonconvex quadratic program (ﬁnding its global minimum) which, in general, could be computationally prohibitive. To the best of our knowledge, our paper provides the ﬁrst ﬁnite-time overall computational complexity analysis for reaching an SOSP of Problem (1).

2 Preliminaries and Deﬁnitions

In the case of unconstrained minimization of the objective function f, the ﬁrst-order and second-order necessary conditions for a point x to be a local minimum of that are deﬁned as rf(x ) = 0d and r2f(x ) 0d d, respectively. If a point satisﬁes these conditions it is called a second-order stationary point (SOSP). If the second condition becomes strict, i.e., r2f(x) 0, then we recover the sufﬁcient conditions for a local minimum. However, to derive ﬁnite time convergence bounds for achieving an SOSP, these conditions should be relaxed. In other words, the goal should be to ﬁnd an approximate SOSP where the approximation error can be arbitrarily small. For the case of unconstrained minimization, a point x is called an ( , γ)-second-order stationary point if it satisﬁes krf(x )k and r2f(x ) γId, where and γ are arbitrary positive constants. To study the constrained setting, we ﬁrst state the necessary conditions for a local minimum of Problem (1). Proposition 1 ([4]). If x 2 C is a local minimum of the function f over the convex set C, then

rf(x )>(x x ) 0, for all x 2 C, (2)

(x x )>r2f(x )(x x ) 0, for all x 2 C s. t. rf(x )>(x x ) = 0. (3)

The conditions in (2) and (3) are the ﬁrst-order and second-order necessary optimality conditions, respectively. By making the inequality in (3) strict, i.e., (x x )>r2f(x )(x x ) > 0, we recover the sufﬁcient conditions for a local minimum when C is a polyhedral [4]. Further, if the inequality in (3) is replaced by (x x )>r2f(x )(x x ) δkx x k2 for some δ > 0, we obtain the sufﬁcient conditions for a local minimum of Problem (1) for any convex constraint C; see [4]. If a point x satisﬁes the conditions in (2) and (3) it is an SOSP of Problem (1).

As in the unconstrained setting, the ﬁrst-order and second-order optimality conditions may not be satisﬁed in ﬁnite number of iterations, and we focus on ﬁnding an approximate SOSP. Deﬁnition 1. Recall the twice continuously differentiable function f : Rd ! R and the convex closed set C Rd introduced in Problem (1). We call x 2 C an ( , γ)-second order stationary point of Problem (1) if the following conditions are satisﬁed.

rf(x )>(x x ) , for all x 2 C, (4)

(x x )>r2f(x )(x x ) γ, for all x 2 C s. t. rf(x )>(x x ) = 0. (5)

If a point only satisﬁes the ﬁrst condition, we call it an -ﬁrst order stationary point.

We further formally deﬁne strict saddle points for the constrained optimization problem in (1). Deﬁnition 2. A point x 2 C is a δ-strict saddle point of Problem (1) if (i) for all x 2 C the condition rf(x )>(x x ) 0 holds, and (ii) there exists a point y such that

(y x )>r2f(x )(y x ) < δ, y 2 C and rf(x )>(y x ) = 0. (6)

According to Deﬁnitions 1 and 2 if all saddle points are δ-strict and γ δ, any ( , γ)-SOSP of Problem (1) is an approximate local minimum.

We emphasize that in this paper we do not assume that all saddles are strict to prove convergence to an SOSP. We formally deﬁned strict saddles just to clarify that if all the saddles are strict then convergence to an approximate SOSP is equivalent to convergence to an approximation local minimum.

Our goal throughout the rest of the paper is to design an algorithm which ﬁnds an ( , γ)-SOSP of Problem (1). To do so, we ﬁrst assume the following conditions are satisﬁed. Assumption 1. The gradients rf are L-Lipschitz continuous over the set C, i.e., for any x, x 2 C,

krf(x) rf( x)k Lkx xk. (7)

Assumption 2. The Hessians r2f are M-Lipschitz continuous over the set C, i.e., for any x, x 2 C

kr2f(x) r2f( x)k Mkx xk. (8)

Assumption 3. The diameter of the compact convex set C is upper bounded by a constant D, i.e.,

max x, x2C{kx ˆxk} D. (9)

3 Main Result

In this section, we introduce a generic framework to reach an ( , γ)-SOSP of the non-convex function f over the convex set C, when C has a speciﬁc structure as we describe below. In particular, we focus on the case when we can solve a quadratic program (QP) of the form

minimize x>Ax + b>x + c subject to x 2 C, (10)

up to a constant factor 1 in a ﬁnite number of arithmetic operations. Here, A 2 Rd is a symmetric matrix, b 2 Rd is a vector, and c 2 R is a scalar. To clarify the notion of solving a problem up to a constant factor , consider x as a global minimizer of (10). Then, we say Problem (10) is solved up to a constant factor 2 (0, 1] if we have found a feasible solution x 2 C such that

x >Ax + b>x + c x>A x + b> x + c (x >Ax + b>x + c). (11)

Note that here w.l.o.g. we have assumed that the optimal objective function value x >Ax +b>x +c is non-positive. Larger constant implies that the approximate solution is more accurate. If x satisﬁes the condition in (11), we call it a -approximate solution of Problem (10). Indeed, if = 1 then x is a global minimizer of Problem (10).

In Algorithm 1, we introduce a generic framework that achieves an ( , γ)-SOSP of Problem (1) whose running time is polynomial in 1, γ 1, 1 and d, when we can ﬁnd a -approximate solution of a quadratic problem of the form (10) in a time that is polynomial in d. The proposed scheme consists of two major stages. In the ﬁrst phase, as mentioned in Steps 2-4, we use a ﬁrst-order update, i.e., a gradient-based update, to ﬁnd an -FOSP, i.e., we update the decision variable x according to a ﬁrst-order update until we reach a point xt that satisﬁes the condition

rf(xt)>(x xt) , for all x 2 C. (12)

In Section 4, we study in detail projected gradient descent and conditional gradient algorithms for the ﬁrst order phase of the proposed framework. Interestingly, both of these algorithms require at most O( 2) iterations to reach an -ﬁrst order stationary point.

The second stage of the proposed scheme uses second-order information of the objective function f to escape from the stationary point if it is a local maximum or a strict saddle point. To be more precise, if we assume that xt is a feasible point satisfying the condition (12), we then aim to ﬁnd a descent direction by solving the following quadratic program

minimize q(u) := (u xt)>r2f(xt)(u xt)

subject to u 2 C, rf(xt)>(u xt) = 0, (13)

up to a constant factor where 2 (0, 1]. To be more speciﬁc, if we deﬁne q(u ) as the optimal objective function value of the program in (13), we focus on the cases that we can obtain a feasible point ut which is a -approximate solution of Problem (13), i.e., ut 2 C satisﬁes the constraints in (13) and

q(u ) q(ut) q(u ). (14) The problem formulation in (13) can be transformed into the quadratic program in (10); see Section 5 for more details. Note that the constant is independent of , γ, and d and only depends on the structure of the convex set C. For instance, if C is deﬁned in terms of m quadratic constraints one can ﬁnd a = m 2 approximate solution of (13) after at most O(md3) arithmetic operations (Section 5).

After computing a feasible point ut satisfying the condition in (14), we check the quadratic objective function value at the point ut, and if the inequality q(ut) < γ holds, we follow the update

xt+1 = (1 σ)xt + σut, (15)

where σ is a positive stepsize. Otherwise, we stop the process and return xt as an ( , γ)-second order stationary point of Problem (1). To check this claim, note that Algorithm 1 stops if we reach a point xt that satisﬁes the ﬁrst-order stationary condition rf(xt)>(x xt) , and the objective function value for the -approximate solution of the quadratic subproblem is larger than γ, i.e., q(ut) γ. The second condition alongside with the fact that q(ut) satisﬁes (14) implies that q(u ) γ. Therefore, for any x 2 C and rf(xt)>(x xt) = 0, it holds that

(x xt)>r2f(xt)(x xt) γ. (16)

Algorithm 1 Generic framework for escaping saddles in constrained optimization

Require: Stepsize σ > 0. Initialize x0 2 C

1: for t = 1, 2, . . . do 2: if xt is not an -ﬁrst order stationary point then 3: Compute xt+1 using ﬁrst-order information (Frank-Wolfe or projected gradient descent) 4: else 5: Find ut: a -approximate solution of (13) 6: if q(ut) < γ then 7: Compute the updated variable xt+1 = (1 σ)xt + σut; 8: else 9: Return xt and stop. 10: end if 11: end if 12: end for

These two observations show that the outcome of the proposed framework in Algorithm 1 is an ( , γ)-SOSP of Problem (1). Now it remains to characterize the number of iterations that Algorithm 1

needs to perform before reaching an ( , γ)-SOSP which we formally state in the following theorem. Theorem 1. Consider the optimization problem deﬁned in (1). Suppose that the conditions in Assumptions 1-3 are satisﬁed. If in the ﬁrst-order stage, i.e., Steps 2-4, we use the update of Frank-Wolfe or projected gradient descent, the generic framework proposed in Algorithm 1 ﬁnds an ( , γ)-second-order stationary point of Problem (1) after at most O(max{ 2, 3γ 3}) iterations.

The result in Theorem 1 shows that if the convex constraint C is such that one can solve the quadratic subproblem in (13) -approximately, then the proposed generic framework ﬁnds an ( , γ)-SOSP point of Problem (1) after at most O( 2) ﬁrst-order and O( 3γ 3) second-order updates.

To prove the claim in Theorem 1, we ﬁrst review ﬁrst-order conditional gradient and projected gradient algorithms and show that if the current iterate is not a ﬁrst-order stationary point, by following either of these updates the objective function value decreases by a constant of O( 2) (Section 4). We then focus on the second stage of Algorithm 1 which corresponds to the case that the current iterate is an -FOSP and we need to solve the quadratic program in (13) approximately (Section 5). In this case, we show that if the iterate is not an ( , γ)-SOSP, by following the update in (15) the objective function value decreases at least by a constant of O( 3γ3). Finally, by combining these two results it can be shown that Algorithm 1 ﬁnds an ( , γ)-SOSP after at most O(max{ 2, 3γ 3}) iterations.

4 First-Order Step: Convergence to a First-Order Stationary Point

In this section, we study two different ﬁrst-order methods for the ﬁrst stage of Algorithm 1. The result in this section can also be independently used for convergence to an FOSP of Problem (1) satisfying

rf(x )>(x x ) , for all x 2 C, (17)

where > 0 is a positive constant. Although for Algorithm 1 we assume that C has a speciﬁc structure as mentioned in (10), the results in this section hold for any closed and compact convex set C. To keep our result as general as possible, in this section, we study both conditional gradient and projected-based methods when they are used in the ﬁrst-stage of the proposed generic framework.

4.1 Conditional gradient update

The conditional gradient (Frank-Wolfe) update has two steps. We ﬁrst solve the linear program

vt = argmax

{ rf(xt)>v}. (18)

Then, we compute the updated variable xt+1 according to the update

xt+1 = (1 )xt + vt, (19)

where is a stepsize. In the following proposition, we show that if the current iterate is not an -ﬁrst order stationary point, then by updating the variable according to (18)-(19) the objective function value decreases. The proof of the following proposition is adopted from [31].

Proposition 2. Consider the optimization problem in (1). Suppose Assumptions 1 and 3 hold. Set the stepsize in (19) to = /D2L. Then, if the iterate xt at step t is not an -ﬁrst order stationary point, the objective function value at the updated variable xt+1 satisﬁes the inequality

f(xt+1) f(xt) 2

The result in Proposition 2 shows that by following the update of the conditional gradient method the objective function value decreases by O( 2), if an -FOSP is not achieved. Remark 1. In step 3 of Algorithm 1 we ﬁrst check if xt is an -FOSP. This can be done by evaluating

x2C{rf(xt)>(x xt)} = max

x2C { rf(xt)>x} + rf(xt)>xt (21)

and comparing the optimal value with . Note that the linear program in (21) is the same as the one in (18). Therefore, by checking the ﬁrst-order optimality condition of xt, the variable vt is already computed, and we need to solve only one linear program per iteration.

4.2 Projected gradient update

The projected gradient descent (PGD) update consists of two steps: (i) descending through the gradient direction and (ii) projecting the updated variable onto the convex constraint set. These two steps can be combined together and the update can be explicitly written as

xt+1 = C{xt rf(xt)}, (22)

where C(.) is the Euclidean projection onto the convex set C and is a positive stepsize. In the following proposition, we ﬁrst show that by following the update of PGD the objective function value decreases by a constant until we reach an - FOSP. Further, we show that the number of required iterations for PGD to reach an -FOSP is of O( 2). Proposition 3. Consider Problem (1). Suppose Assumptions 1 and 3 are satisﬁed. Further, assume that the gradients rf(x) are uniformly bounded by K for all x 2 C. If the stepsize of the projected gradient descent method deﬁned in (22) is set to = 1/L the objective function value decreases by

f(xt+1) f(xt) 2L 2(K + LD)2 , (23)

Moreover, iterates reach a ﬁrst-order stationary point satisfying (17) after at most O( 2) iterations.

Proposition 3 shows that by following the update of PGD the function value decreases by O( 2) until we reach an -FOSP. It further shows PGD obtains an -FOSP satisfying (17) after at most O( 2) iterations. To the best of our knowledge, this result is also novel, since the only convergence guarantee for PGD in [26] is in terms of number of iterations to reach a point with a gradient mapping norm less than , while our result characterizes number of iterations to satisfy (17). Remark 2. To use the PGD update in the ﬁrst stage of Algorithm 1 one needs to deﬁne a criteria to check if xt is an -FOSP or not. However, in PGD we do not solve the linear program minx2C{rf(xt)>(x xt)}. This issue can be resolved by checking the condition kxt xt+1k /(K + LD) which is a sufﬁcient condition for the condition in (17). In other words, if this condition holds we stop and xt is an -FOSP; otherwise, the result in (23) holds and the function value decreases. For more details please check the proof of Proposition 3.

5 Second-Order Step: Escape from Saddle Points

In this section, we study the second stage of the framework in Algorithm 1 which corresponds to the case that the current iterate is an -FOSP. Note that when we reach a critical point the goal is to ﬁnd a feasible point u 2 C in the tangent space rf(xt)>(u xt) = 0 that makes the inner product (u xt)>r2f(xt)(u xt) smaller than γ. To achieve this goal we need to check the minimum

value of this inner product over the constraints, i.e., we need to solve the quadratic program in (13) up to a constant factor 2 (0, 1]. In the following proposition, we show that the updated variable according to (15) decreases the objective function value if the condition q(ut) < γ holds.

Proposition 4. Consider the quadratic program in (13). Let ut be a -approximate solution for quadratic subproblem in (13). Suppose that Assumptions 2 and 3 hold. Further, set the stepsize σ = γ/MD3. If the quadratic objective function value q evaluated at ut satisﬁes the condition q(ut) < γ, then the updated variable according to (15) satisﬁes the inequality

f(xt+1) f(xt) 3γ3

3M 2D6 . (24)

The only unanswered question is how to solve the quadratic subproblem in (13) up to a constant factor 2 (0, 1]. For general C, the quadratic subproblem could be NP-hard [34]; however, for some special choices of the convex constraint C, this quadratic program (QP) can be solved either exactly or approximately up to a constant factor. In the following section, we focus on the quadratic constraint case, but indeed there are other classes of constraints that satisfy our required condition.

5.1 Quadratic constraints case

In this section, we focus on the case where the constraint set C is deﬁned as the intersection of m ellipsoids centered at the origin.1 In particular, assume that the set C is given by

C := {x 2 Rd | x>Qix 1, for all i = 1, . . . , m}, (25)

where Qi 2 Sd

+. Under this assumption, the QP in (13) can be written as

u (u xt)>r2f(xt)(u xt)

s.t. u>Qiu 1, for i = 1, . . . , m and rf(xt)>(u xt) = 0. (26)

Note that the equality constraint rf(xt)>(u xt) = 0 does not change the hardness of the problem and can be easily eliminated. To do so, ﬁrst deﬁne a new optimization variable z := u xt to obtain

z z>r2f(xt)z

s.t. (z + xt)>Qi(z + xt) 1, for i = 1, . . . , m and rf(xt)>z = 0, (27)

Then, ﬁnd a basis for the tangent space rf(xt)>z = 0. Indeed, using the Gramm-Schmidt procedure, we can ﬁnd an orthonormal basis for the space Rd of the form {v1, . . . , vd 1, rf(xt) krf(xt)k} at the complexity of O(d3). If we deﬁne A = [v1; . . . ; vd 1] 2 Rd d 1 as the concatenation of the vectors {v1, . . . , vd 1}, then any vector z satisfying rf(xt)>z = 0 can be written as z = Ay where y 2 Rd 1. Hence, (27) is equivalent to

z y>A>r2f(xt)Ay s.t. (Ay + xt)>Qi(Ay + xt) 1, for i = 1, . . . , m. (28)

This procedure reduces the dimension of the problem from d to d 1. It is not hard to check that the center of ellipsoids in (28) is A>xt. By a simple change of variable Aˆy := Ay + xt we obtain

z ˆy>A>r2f(xt)Aˆy 2x>

t r2f(xt)Aˆy + x>

t r2f(xt)xt

s.t. ˆy>A>Qi Aˆy 1, for i = 1, . . . , m. (29)

Deﬁne the matrices Qi := A>Qi A and Bt := A>r2f(xt)A, the vector st = 2x>

t r2f(xt)A, and the scalar ct := x>

t r2f(xt)xt. Using these deﬁnitions the problem reduces to

z q(ˆy) := ˆy>Btˆy + s>

t ˆy + ct s.t. ˆy> Qiˆy 1, for i = 1, . . . , m. (30)

Note that the matrices Qi 2 Sd

+ are positive semideﬁnite, while the matrix Bt 2 Sd might be indeﬁnite. Indeed, the optimal objective function value of the program in (30) is equal to the optimal objective function value of (26). Further, note that if we ﬁnd a -approximate solution ˆy for (30), we can recover a -approximate solution u for (26) using the transformation u = Aˆy .

1To simplify the constant factor approximation we assume ellipsoids are centered at the origin. If we drop this assumption then will depend on the maximum distance between the origin and the boundary of each of the ellipsoids, e.g., see equation (6) in [44].

The program in (30) is a speciﬁc Quadratic Constraint Quadratic Program (QCQP), where all the constraints are centered at 0. For the speciﬁc case of m = 1, the duality gap of this problem is zero and simply by transferring the problem to the dual domain one can solve Problem (30) exactly. In the following proposition, we focus on the general case of m 1 and explain how to ﬁnd a -approximate solution for (30).

Proposition 5. Consider Problem (30) and deﬁne qmin as the minimum objective value of the problem. Based on the result in [22], there exists a polynomial time method that obtains a point ˆy

q(ˆy ) 1 m2(1 + )2 qmin +

1 1 m2(1 + )2

t r2f(xt)xt (31)

after at most O(d3(m log(1/δ)+log(1/ )+log d)) arithmetic operations, where δ is the ratio of the radius of the largest inscribing sphere over that of the smallest circumscribing sphere of the feasible set. Further, based on [44], using a SDP relaxation of (30) one can ﬁnd a point ˆy such that

t r2f(xt)xt. (32)

Proof. If we deﬁne the function q as q(x) := q(x) ct, using the approaches in [22] and [44], we can ﬁnd a approximate solution for minˆy q(ˆy) subject to ˆy> Qiˆy 1 for i = 1, . . . , m. In other words, we can ﬁnd a point ˆy such that q(ˆy ) qmin where 0 < < 1 and qmin is the minimum objective function value of q over the constraint set which satisﬁes qmin = qmin ct. Replacing

q(ˆy ) and qmin by their deﬁnitions and regrouping the terms imply that ˆy satisﬁes the condition q(ˆy ) qmin + (1 )ct. Replacing by 1 m2(1+ )2 (which is the constant factor approximation shown in [22]) leads to the claim in (31), and substituting by 1/m (which is the approximation bound in [44]) implies the result in (32).

The result in Proposition 5 indicates that if x>

t r2f(xt)xt is non-positive, then one can ﬁnd a - approximate solution for Problem (30) and consequently Problem (26). This condition is satisﬁed if we assume that maxx2C x>r2f(x)x 0. For instance, for a concave minimization problem over the convex set C this condition is satisﬁed. In fact, it can be shown that our analysis still stands even if maxx2C x>r2f(x)x is at most O(γ). Note that this condition is signiﬁcantly weaker than requiring the function to be concave when restricted to the feasible set. The condition essentially implies that the quadratic term in the Taylor expansion of the function evaluated at the origin should be negative (or not too positive).

Corollary 1. Consider a convex set C which is deﬁned as the intersection of m 1 ellipsoids centered at the origin. Further, assume that the objective function Hessian r2f satisﬁes the condition maxx2C x>r2f(x)x 0. Then, for = 1/m and = 1/m2, it is possible to ﬁnd a -approximate solution of Problem (13) in time polynomial in m and d.

By using the approach in [22], we can solve the QCQP in (29) with the approximation factor 1/m2 for m 1 at the overall complexity of O(md3) when the constraint C is deﬁned as m convex quadratic constraints. As the total number of calls to the second-order stage is at most O( 3γ 3) = O(m6γ 3), we obtain that the total number of arithmetic operations for the secondorder stage is at most O(m7d3γ 3). The constant factor can be improved to 1/m if we solve the SDP relaxation problem suggested in [44].

6 Stochastic Extension

In this section, we focus on stochastic constrained minimization problems. Consider the optimization problem in (1) when the objective function f is deﬁned as an expectation of a set of stochastic functions F : Rd Rr ! R with inputs x 2 Rd and 2 Rr, where is a random variable with probability distribution P. To be more precise, we consider the optimization problem

minimize f(x) := E [F(x, )] , subject to x 2 C. (33)

Our goal is to ﬁnd a point which satisﬁes the necessary optimality conditions with high probability.

Algorithm 2

Require: Stepsize σt > 0. Initialize x0 2 C

1: for t = 1, 2, . . . do 2: Compute vt = argmaxv2C{ d>

t v} 3: if d>

t (vt xt) < /2 then 4: Compute xt+1 = (1 )xt + vt 5: else 6: Find ut: a -approximate solution of

minu (u xt)>Ht(u xt) s.t. u 2 C, d>

t (u xt) r. 7: if q(ut) < γ/2 then 8: Compute the updated variable xt+1 = (1 σ)xt + σut; 9: else 10: Return xt and stop. 11: end if 12: end if 13: end for

Consider the vector dt = (1/bg) Pbg

i=1 r F(xt, i) and matrix Ht = (1/b H) Pb H

i=1 r2F(xt, i) as stochastic approximations of the gradient rf(xt) and Hessian r2f(xt), respectively. Here bg and b H are the gradient and Hessian batch sizes, respectively, and the vectors i are the realizations of the random variable . In Algorithm 2, we present the stochastic variant of our proposed scheme for ﬁnding an ( , γ)-SOSP of Problem (33). Algorithm 2 differs from Algorithm 1 in using the stochastic gradients dt and Hessians Ht in lieu of the exact gradients rf(xt) and r2f(xt) Hessians. The second major difference is the inequality constraint in step 6. Here instead of using the constraint d>

t (u xt) = 0 we need to use d>

t (u xt) r, where r > 0 is a properly chosen constant. This modiﬁcation is needed to ensure that if a point satisﬁes this constraint with high probability it also satisﬁes the constraint rf(xt)>(u xt) = 0. This modiﬁcation implies that we need to handle a linear inequality constraint instead of the linear equality constraint, which is computationally manageable for some constraints including the case that C is a single ball constraint [28]. To prove our main result we assume that the following conditions also hold.

Assumption 4. The variance of the stochastic gradients and Hessians are uniformly bounded by constants 2 and 2, respectively, i.e., for any x 2 C and we can write

kr F(x, ) rf(x)k2

kr2F(x, ) r2f(x)k2

The required conditions in Assumption 4 ensure that the variances of stochastic gradients and Hessians are uniformly bounded above, which are customary in stochastic optimizaiton.

In the following theorem, we characterize the iteration complexity of Algorithm 2 to reach an ( , γ)-SOSP of Problem (33) with high probability.

Theorem 2. Consider the optimization problem in (33). Suppose the conditions in Assumptions 1-4 are satisﬁed. If the batch sizes are bg = O(max{ 4γ 4, 2}) and b H = O( 2γ 2) and we set the parameter r = O( 2γ2), then the outcome of the proposed framework outlined in Algorithm 2 is an ( , γ)-second-order stationary point of Problem (33) with high probability. Further, the total number of iterations to reach such point is at most O(max{ 2, 3γ 3}) with high probability.

The result in Theorem 2 indicates that the total number of iterations to reach an ( , γ)-SOSP is at most O(max{ 2, 3γ 3}). As each iteration at most requires O(max{ 4γ 4, 2}) stochastic gradient and O( 2γ 2) stochastic Hessian evaluations, the total number of stochastic gradient and Hessian computations to reach an ( , γ)-SOSP is of O(max{ 2 4γ 4, 4, 7γ 7}) and O(max{ 2 3γ 3, 5γ 5}), respectively.

Acknowledgment

This work was supported by DARPA Lagrange and ONR BRC Program. The authors would like to thank Yue Sun for pointing out a missing condition in the ﬁrst draft of the paper.

[1] N. Agarwal, Z. Allen Zhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate local minima faster than

gradient descent. In STOC, pages 1195 1199, 2017.

[2] Z. Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. Co RR, abs/1708.08694, 2017.

[3] Z. Allen Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In ICML, pages

699 707, 2016.

[4] D. P. Bertsekas. Nonlinear programming. Athena scientiﬁc Belmont, 1999.

[5] W. Bian, X. Chen, and Y. Ye. Complexity analysis of interior point algorithms for non-lipschitz and

nonconvex minimization. Math. Program., 149(1-2):301 327, 2015.

[6] J. V. Burke, J. J. Mor e, and G. Toraldo. Convergence properties of trust region methods for linear and

convex constraints. Math. Program., 47(1-3):305 336, 1990.

[7] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for non-convex optimization.

Co RR, abs/1611.00756, 2016.

[8] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. convex until proven guilty : Dimension-free

acceleration of gradient descent on non-convex functions. In ICML, pages 654 663, 2017.

[9] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for ﬁnding stationary points i. ar Xiv

preprint ar Xiv:1710.11606, 2017.

[10] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for ﬁnding stationary points ii: First-order

methods. ar Xiv preprint ar Xiv:1711.00841, 2017.

[11] C. Cartis, N. Gould, and P. Toint. Adaptive cubic regularisation methods for unconstrained optimization.

part I: motivation, convergence and numerical results. Math. Program., 127(2):245 295, 2011.

[12] C. Cartis, N. Gould, and P. Toint. Adaptive cubic regularisation methods for unconstrained optimization.

part II: worst-case functionand derivative-evaluation complexity. Math. Program., 130(2):295 319, 2011.

[13] C. Cartis, N. Gould, and P. Toint. Complexity bounds for second-order optimality in unconstrained

optimization. J. Complexity, 28(1):93 108, 2012.

[14] C. Cartis, N. Gould, and P. Toint. An adaptive cubic regularization algorithm for nonconvex optimization

with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis, 32(4): 1662 1695, 2012.

[15] C. Cartis, N. Gould, and P. Toint. On the evaluation complexity of cubic regularization methods for

potentially rank-deﬁcient nonlinear least-squares problems and its relevance to constrained nonlinear optimization. SIAM J. Opt., 23(3):1553 1574, 2013.

[16] C. Cartis, N. Gould, and P. Toint. On the evaluation complexity of constrained nonlinear least-squares

and general constrained nonlinear optimization using second-order methods. SIAM Journal on Numerical Analysis, 53(2):836 851, 2015.

[17] C. Cartis, N. Gould, and P. Toint. Second-order optimality and beyond: Characterization and evaluation

complexity in convexly constrained nonlinear optimization. Foundations of Computational Mathematics, pages 1 35, 2017.

[18] A. R. Conn, N. Gould, A. Sartenaer, and P. Toint. Global convergence of a class of trust region algorithms

for optimization using inexact projections on convex constraints. SIAM J. on Opt., 3(1):164 221, 1993.

[19] F. E. Curtis, D. P. Robinson, and M. Samadi. A trust region algorithm with a worst-case iteration complexity

of \mathcal {O}(\epsilonˆ{-3/2}) for nonconvex optimization. Math. Program., 162(1-2):1 32, 2017.

[20] G. Di Pillo, S. Lucidi, and L. Palagi. Convergence to second-order stationary points of a primal-dual

algorithm model for nonlinear programming. Mathematics of Operations Research, 30(4):897 915, 2005.

[21] F. Facchinei and S. Lucidi. Convergence to second order stationary points in inequality constrained

optimization. Mathematics of Operations Research, 23(3):746 766, 1998.

[22] M. Fu, Z.-Q. Luo, and Y. Ye. Approximation algorithms for quadratic programming. Journal of combina-

torial optimization, 2(1):29 50, 1998.

[23] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points - online stochastic gradient for tensor

decomposition. In COLT, pages 797 842, 2015.

[24] R. Ge, J. Lee, and T. Ma. Matrix completion has no spurious local minimum. In NIPS, pages 2973 2981,

[25] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming.

Math. Program., 156(1-2):59 99, 2016.

[26] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic

composite optimization. Math. Program., 155(1-2):267 305, 2016.

[27] G. Haeser, H. Liu, and Y. Ye. Optimality condition and complexity analysis for linearly-constrained

optimization without differentiability on the boundary. Math. Program., pages 1 37, 2017.

[28] V. Jeyakumar and G. Li. Trust-region problems with linear inequality constraints: exact SDP relaxation,

global optimality and robust optimization. Mathematical Programming, 147(1-2):171 206, 2014.

[29] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efﬁciently. In

ICML, pages 1724 1732, 2017.

[30] C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points faster than

gradient descent. Co RR, abs/1711.10456, 2017.

[31] S. Lacoste-Julien. Convergence rate of Frank-Wolfe for non-convex objectives. ar Xiv preprint ar Xiv:1607.00345, 2016.

[32] L. Lei, C. Ju, J. Chen, and M. I. Jordan. Non-convex ﬁnite-sum optimization via SCSG methods. In

Advances in Neural Information Processing Systems 30, pages 2345 2355, 2017.

[33] J. M. Mart ınez and M. Raydan. Cubic-regularization counterpart of a variable-norm trust-region method

for unconstrained minimization. J. Global Optimization, 68(2):367 385, 2017.

[34] K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming.

Math. Program., 39(2):117 129, 1987.

[35] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science &

Business Media, 2013.

[36] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Math.

Program., 108(1):177 205, 2006.

[37] S. Paternain, A. Mokhtari, and A. Ribeiro. A second order method for nonconvex optimization. ar Xiv

preprint ar Xiv:1707.08028, 2017.

[38] S. J. Reddi, A. Hefny, S. Sra, B. P oczos, and A. J. Smola. Stochastic variance reduction for nonconvex

optimization. In ICML, pages 314 323, 2016.

[39] S. J. Reddi, S. Sra, B. P oczos, and A. J. Smola. Fast incremental method for smooth nonconvex optimization.

In IEEE Conference on Decision and Control, CDC, pages 1971 1977, 2016.

[40] S. J. Reddi, M. Zaheer, S. Sra, B. P oczos, F. Bach, R. Salakhutdinov, and A. J. Smola. A generic approach

for escaping saddle points. In AISTATS, pages 1233 1242, 2018.

[41] C. W. Royer and S. J. Wright. Complexity analysis of second-order line-search algorithms for smooth

nonconvex optimization. ar Xiv preprint ar Xiv:1706.03131, 2017.

[42] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. In IEEE International Symposium

on Information Theory, ISIT 2016, pages 2379 2383, 2016.

[43] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere I: overview and the geometric

picture. IEEE Trans. Information Theory, 63(2):853 884, 2017.

[44] P. Tseng. Further results on approximating nonconvex quadratic optimization by semideﬁnite programming

relaxation. SIAM Journal on Optimization, 14(1):268 283, 2003.

[45] Y. Xu and T. Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time.

ar Xiv preprint ar Xiv:1711.01944, 2017.

[46] Y. Ye. On afﬁne scaling algorithms for nonconvex quadratic programming. Math. Program., 56(1-3):

285 300, 1992.