# houdini_escaping_from_moderately_constrained_saddles__8820fe23.pdf

HOUDINI: Escaping from Moderately Constrained Saddles

Dmitrii Avdiukhin1 and Grigory Yaroslavtsev2

1Indiana University, Bloomington, IN 2George Mason University, Fairfax, VA davdyukh@iu.edu, grigory@grigory.us

We give polynomial time algorithms for escaping from high-dimensional saddle points under a moderate number of constraints. Given gradient access to a smooth function f : Rd R we show that (noisy) gradient descent methods can escape from saddle points under a logarithmic number of inequality constraints. While analogous results exist for unconstrained and equality-constrained problems, we make progress on the major open question of convergence to second-order stationary points in the case of inequality constraints, without reliance on NP-oracles or altering the definitions to only account for certain constraints. Our results hold for both regular and stochastic gradient descent.

1 Introduction

Achieving convergence of gradient descent to an (approximate) local minimum is a central question in non-convex optimization for machine learning. In recent years, breakthrough progress starting with the work of [Ge et al., 2015] has led to a flurry of results in this area (see e.g. [Jin et al., 2017; Du et al., 2017; Mokhtari et al., 2018; Jin et al., 2018; Carmon et al., 2017; Carmon and Duchi, 2018; Staib et al., 2019; Carmon and Duchi, 2020; Jin et al., 2021]), culminating in almost optimal bounds [Zhang and Li, 2021]. However, despite this success a key open question of [Ge et al., 2015] still remains unanswered can gradient methods efficiently escape from saddle points in constrained non-convex optimization? In fact, even basic linear inequality constraints still remain an obstacle: Dealing with inequality constraints is left as future work [Ge et al., 2015]1. This is due to the NP-hardness of the related copositivity problem [Murty and Kabadi, 1987], which corresponds to the case when the number of constraints is linear in the dimension. In this paper, we make progress on this open question in the case when the number of constraints depends moderately on the dimension. Consider a feasible set defined by k linear inequality constraints: S = {x Rd | Ax b}, where A Rk d and

1Using Lagrangian multipliers, equality constraints can be seen as reducing the dimension of the otherwise unconstrained problem.

b Rk. Let Bd(x, r) be a d-dimensional closed ball of radius r centred at x. We write B(x, r) when the dimension is clear from the context and drop x when x = 0. Our goal is to find minx S f(x) for the objective function f : Rd R over S. We first introduce the standard smoothness assumption. Assumption 1 (Smoothness). The objective function f satisfies the following properties: 1. (First order) f has an L-Lipschitz gradient (f is Lsmooth): f(x) f(y) L x y , x, y Rd. 2. (Second order) f has a ρ-Lipschitz Hessian: 2f(x) 2f(y) ρ x y , x, y Rd. Definition 1 (Local minimum). For S Rd, let f : Rd R. A point x is a local minimum of f in S if and only if there exists r > 0 such that f(x) f(x ) for all x S B(x , r). Since finding a local minimum is NP-hard even in the unconstrained case (see e.g. [Anandkumar and Ge, 2016] and the references within) the notion of a local minimum is typically relaxed as follows. Definition 2 (Approximate local minimum). For S Rd and f : Rd R a point x is a (δ, r)-approximate local minimum if f(x) f(x ) δ for all x S B(x , r). For smooth functions, one can define stationary points in terms of the gradient and the eigenvalues instead: Definition 3 ([Nesterov and Polyak, 2006; Jin et al., 2021]). A point x is an ε-second-order stationary point (ε-SOSP) if f(x) < ε and λmin( 2f(x)) > ρε, where λmin denotes the smallest eigenvalue. When applying this definition to the constrained case, eigenvectors and eigenvalues are not well-defined since there might be no eigenvectors inside the feasible set, while an escaping direction might exist. Moreover, for f(x) = 1

2 x 2 and any compact feasible set, the Hessian is I at any point with λmin( I) = 1. Hence an ε-SOSP doesn t exist according to the Definition 3, even though a local minimum exists. In fact, Definition 3 arises from the Taylor expansion, which justifies the choice of ρϵ as the bound on the smallest eigenvalue. If the function has a ρ-Lipschitz Hessian: f(x + h) f(x) h f(x) 1

2h 2f(x)h ρ

To guarantee that the discrepancy between the function and its quadratic approximation is small relative to δ (from Definition 2), a natural choice of r is 3p

δ/ρ, which bounds the

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

discrepancy with Θ(δ). Therefore, based on the quadratic approximation, one can distinguish a (δ, r)-approximate local minimum from not a (cδ, r)-approximate local minimum for c < 1. By setting this r and selecting ε = 3p

δ2ρ, we have ρε = 3p

δρ2 and for any h B(x, r) (see the full version):

f(x+h) f(x) h f(x)+1

2h 2f(x)h ρ

Using the ball radius discussed above we arrive at the following version of Definition 2: Definition 4 (Approximate SOSP). For S Rd, let f : Rd R be a twice-differentiable function with a ρ-Lipschitz Hessian. A point x is a δ-second-order stationary point (δSOSP) if for r = 3p

inf x S B(x ,r) f(x) f(x ) δ

Note that, while Definition 4 doesn t seemingly use second-order information, our choice of the ball radius guarantees that the function is close to its quadratic approximation. In particular, we can determine whether f is a cδ-SOSP for a constant c by only using second-order information.

1.1 Our Contribution Our results hold for stochastic gradient descent (SGD): Assumption 2. Access to a stochastic gradient oracle g(x): 1. (Unbiased expectation) E[g(x)] = f(x). 2. (Variance) E[ g(x) f(x) 2] σ2. Our main result is the following theorem which quantifies the complexity of finding an approximate SOSP under a moderate number of linear inequality constraints, showing that this problem is solvable in polynomial time for k = O(log d). We refer to a function as (L, ρ)-smooth if it satisfies Assumption 1 and simply second-order smooth if both smoothness parameters are constant. Theorem 5. Let S be a set defined by an intersection of k linear inequality constraints. Let f be a second-order smooth bounded function. Given access to a stochastic gradient oracle satisfying Assumption 2, there exists an algorithm which for any δ > 0 finds a δ-SOSP in O( 1

δ (d32k + d2σ2

δ4/3 )) time

δ4/3 )) stochastic gradient oracle calls. In the deterministic gradient case (σ = 0), the time complexity is O( d32k

δ ) and the number of gradient oracle calls is O( d

δ ). The exponential dependence of time complexity on k in our results (not required in the oracle calls) is most likely unavoidable due to the following hardness result, which implies that when k = d then the complexity of this problem can t be polynomial in d under the standard hardness assumptions. Remark 6 (Matrix copositivity [Murty and Kabadi, 1987]). For a quadratic function f(x) = x T Mx subject to constraints xi 0 for all i, it is NP-hard to decide whether there exists a solution with f(x) > 0. Related results in convex optimization are covered in [Boyd and Vandenberghe, 2004; Bubeck and others, 2015]. Among related results in non-convex optimization, here we only focus on the algorithms using only gradient information.

1.2 Related Work Unconstrained optimization. Recall that an ϵ-first-order stationary point (ϵ-FOSP) is defined so that f(x) ϵ. Analyses of convergence to an ε-FOSP are a cornerstone of non-convex optimization (see e.g. classic texts [Bertsekas, 1997; Nocedal and Wright, 1999]). Quantitative analysis of convergence to an ϵ-SOSP (Definition 3) started with the breakthrough work by [Ge et al., 2015] further refined in [Jin et al., 2017; Carmon and Duchi, 2018; Jin et al., 2018; Carmon and Duchi, 2020; Jin et al., 2021] and most recently in [Zhang and Li, 2021], who show an almost optimal bound. Due to the prevalence of SGD in deep learning, stochastic methods have attracted the most attention (see [Allen Zhu, 2018; Allen-Zhu and Li, 2018; Fang et al., 2018; Tripuraneni et al., 2018; Xu et al., 2018; Zhou and Gu, 2020; Zhou et al., 2020] for the case of Lipschitz gradients and [Ge et al., 2015; Daneshmand et al., 2018] for non-Lipschitz gradients). For an in-depth summary of the previous work on unconstrained non-convex optimization we refer the reader to [Jin et al., 2021].

Constrained optimization. The case of equality constraints is typically reducible to the unconstrained case by using Lagrangian multipliers (see e.g. [Ge et al., 2015]). However, the general constrained case is substantially more challenging since even the definitions of stationarity require a substantial revision. For first-order convergence a rich literature exists, covering projected gradient, Frank-Wolfe, cubic regularization, etc (see e.g. [Mokhtari et al., 2018] and the references within). For second-order convergence, the landscape of existing work is substantially sparser due to NP-hardness (Remark 6, [Murty and Kabadi, 1987]). A large body of work focuses on achieving convergence using various forms of NP-oracles (see e.g. [Bian et al., 2015a; Cartis et al., 2018; Mokhtari et al., 2018; Haeser et al., 2019; Nouiehed and Razaviyayn, 2020]), while another approach is to define stationarity in terms of tight constraints only [Avdiukhin et al., 2019; Lu et al., 2020].

Relationship with other definitions of SOSP. As discussed in Remark 6, second-order constrained optimization is NP-hard due to the hardness of the matrix copositivity problem. Definitions of constrained SOSP in the previous work fall into two categories: 1) definitions of scaled stationary points, 1) definitions that only consider active constraints ( active constraints only definitions), 2) definitions that preserve the NP-hardness of the problem and rely on NP-oracles to achieve polynomial-time convergence:

1. (Scaled) For the constraints x 0, [Bian et al., 2015b; O Neill and Wright, 2020] consider the definition of scaled SOSP. The idea is to scale i-th coordinate by xi: i.e. instead of bounding f(x) it bounds X f(x), where X = diag(x1, . . . , xd), and instead of eigenvalues 2f(x) it considers eigenvalues of X 2f(x)X. For this definition, x 0 restricts possible applications.

2. ( Active constraints only ) In [Avdiukhin et al., 2019; Lu et al., 2020] definitions analogous to Definition 3, and the second-order conditions are given with respect to the set of active (i.e. tight for the current iterate) con-

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

straints. This bypasses the NP-hardness since the point at which the hardness of the copositivity problem applies now becomes a stationary point by definition.

3. (NP-hard) In the results relying on NP-oracles (e.g. [Bian et al., 2015a; Mokhtari et al., 2018; Haeser et al., 2019; Nouiehed and Razaviyayn, 2020]) the complexity is shifted on solving black-box quadratic optimization problems of a certain type. A key advantage of these types of approaches is that they can handle an arbitrary number of constraints and hence promising in certain machine learning applications.

What is currently lacking in the state of the art is a quantitative analysis of the complexity of convergence to a secondorder stationary point, which shows full dependence on both the dimension and accuracy while defining stationarity with respect to the full set of constraints, instead of just active constraints only2. Our goal in Theorem 5 is to address this gap and give such an analysis.

1.3 Technical Overview

We address the NP-hardness of the copositivity problem by focusing on the case of a moderate number of constraints and arguing that it can be addressed using gradient-based methods. In order to streamline the presentation we first focus on the key challenge of escaping from a saddle point in a corner defined by the constraints when the underlying function is simply quadratic (Section 2). This is already enough to capture some of the key contributions, while more technical details and the full algorithm are given in Section 3.

Quadratic corner saddle point (Section 2). In this simplified scenario, the NP-hardness comes from the fact that the point we aim to find can lie in the intersection of an arbitrary subset of constraints. By doing an exhaustive search over this set of constraints (Algorithm 1) and enforcing them throughout the search process we are able to reduce to a setting similar to the equality-constrained case (Algorithm 2). We show different subsets of constraints enforced by the algorithm in an example in Figure 1. The key challenge is making this argument formal and arguing that this process converges to a constrained approximate SOSP as in Definition 4. This relies on performing a robust analysis of the properties of the

2While unrefereed manuscripts ([Hsia and Sheu, 2018] and [Nouiehed et al., 2020], relying on [Hsia and Sheu, 2018] as a subroutine) do contain related results, our work differs in a number of important aspects. [Hsia and Sheu, 2018] and [Nouiehed et al., 2020] require access to the exact gradient and Hessian. We only assume access to the stochastic gradient oracle. Furthermore, [Hsia and Sheu, 2018] assumes access to matrix diagonalization. However, the diagonalization can only be found approximately, which compromises the stability of this approach, especially with respect to the linear term transformations. Our approach handles this issue by appropriate perturbation and using known results for matrix diagonalization. Finally, compared with [Hsia and Sheu, 2018], our main contribution is focused on solving a substantially different problem. While [Hsia and Sheu, 2018] find a global minimum of a quadratic problem, we find an approximate local minimum of an arbitrary smooth non-convex function.

1 0.5 0 0.5 1

(a) Feasible set with two inequalities: x 0, y 0.

1 0.5 0 0.5 1

(b) Active constraint y = 0 (green) with no escape direction.

1 0.5 0 0.5 1

(c) Active constraint x = 0 (green) with an escape direction (0, 1). As shown in Lemma 10, out of two escape directions (0, 1) and (0, 1) in this constraint, at least one (0, 1) (blue) lies in the feasible set, and is found by the algorithm.

1 0.5 0 0.5 1

(d) When no constraints are active, the algorithm finds red directions (negative eigenvectors) outside the feasible set. Note that escape directions with no active constraints exist (blue). Lemma 10 guarantees that we find them in some constrained space (Figure 1c).

Figure 1: Function f(x, y) = (

(composition of x2 y2 and rotation by π/6).

points which lead to the hardness of copositivity. In particular, we show that after we guess the set of constraints correctly, the problem reduces to finding the smallest eigenvector (Lemma 9). Exact error analysis of the eigenvector process (Lemma 10, Lemma 11) is then required to complete the proof of the main theorem (Theorem 8).

General case (Section 3). In the algorithm for the general case, we address the three assumptions made in the quadratic corner saddle point case, while also handling the stochasticity in the gradient. The latter part is standard and is handled via variance reduction (see the full version). The full algorithm iterates the escape subroutine (Algorithm 3) until an escaping point is found. The escape subroutine first approximates the Hessian matrix using the gradient oracle and then performs an exhaustive search over the set of active constraints at the escaping point in a way similar to the quadratic corner case. After the correct subset of constraints is fixed the current iterate needs to be projected on this set of constraints, which also necessitates a recomputation of various related parameters. When this is done the problem is solved by a subroutine Algorithm 4, analogous to Algorithm 2 from the quadratic

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

corner case. However, since the function is no longer quadratic and the gradient can be large, several modifications are required. The algorithm tries to find an escaping direction within a ball of radius r, within which the function is well approximated by a quadratic by the smoothness assumption (as discussed above). First, the algorithm tries to escape using the gradient term. If that doesn t work then we consider two cases: 1) the solution is inside the ball of radius r = 3p

δ/ρ (from Definition 4), 2) the solution is on the boundary of this ball. In the first case, the solution is a critical point and hence can be found using the Newton step. In the second case, we diagonalize the Hessian using an orthogonal transformation. This gives a quadratic function with a linear term whose critical points on the boundary of the ball can be found explicitly (up to the required precision) as roots of the corresponding polynomial. We note that the diagonalization performed in this case is the most computationally expensive step in the algorithm, resulting in polynomial dependence on the dimension. Notation. For a set S let Int S be its interior and S be its boundary. For x Rd and S Rd, Proj S(x) = argminy S x y is the projection of x on S. For a square matrix M with eigenvalues λ1 . . . λd, we denote λmin(M) = λ1 and λ|max|(M) = max(|λ1|, |λd|). For S = {x Rd | Ax b} and x S, we say that i-th constraint is active at x if A i x = bi, where Ai is the i-th row of A. O notation hides polylogarithmic dependence on all parameters, including error probability.

2 Quadratic Corner Saddle Point Case We introduce the key ideas of the analysis in a simplified setting when: 1) the function f is quadratic, 2) the gradient is small, 3) the current iterate is located in a corner of the constraint space. Intuitively, this represents the key challenge of the constrained saddle escape problem since its NP-hardness comes from the hardness of the matrix copositivity problem in Remark 6 (i.e. the function is exactly quadratic and has no gradient at the current iterate which lies in the intersection of all constraints). We refer to this setting as the Quadratic Corner Saddle Point problem defined formally below. By shifting the coordinate system, w.l.o.g. we can assume that the saddle point is 0 and f(0) = 03. If 0 is a (δ, r)-QCSP (as defined below), the objective can be decreased by δ within a ball of radius r. Definition 7 (Quadratic Corner Saddle Point). Let S = {x | Ax 0}. For δ, r > 0 and function f(x) = 1

2x Mx, we say that a point 0 is a: (δ, r)-Quadratic Corner Saddle Point ((δ, r)-QCSP) if minx B(r) S f(x) < δ. (δ, r)-boundary QCSP if minx B(r) S f(x) < δ.

In this section, we show how to escape from a (δ, r)-QCSP (see the full version for the full proof): Theorem 8 (Quadratic Corner Saddle Point Escape). Let δ, r > 0. Let f(x) = 1 2x Mx with λ|max|(M) L and

3Algorithms 1 and 2 don t require saddle point x to be 0. All the statements are trivially adapted for the case when x is not 0

Algorithm 1: HOUDINIESCAPECORNER: Escaping from a corner for a quadratic function

input: Starting point x, feasible set S = {y | A(y x) 0 Rk} parameters: δ and r from definition of (δ, r)-QCSP for I 2[k] every subset of constraints do

1 A {y | A i (y x) = 0 for i I}, where Ai is the i-th row of A // Optimize in A

2 y FINDINSIDECORNER(x, A)

3 if y S and f(y) < f(x) δ

Algorithm 2: FINDINSIDECORNER(x, A)

input: Corner x, affine subspace A with x A parameters: δ and r from Definition 7, step size η = 1

L, number of iterations T = O( Lr2

1 Sample ξ N(0, I), x0 Proj A(x + ξ)

2 for t = 0, . . . , T 1 do

3 // Power method step

4 xt+1 Proj A(xt η( f(xt) f(x)))

5 e r x T x

6 return x + e

let S = {x | Ax 0} be defined by k linear inequalities. If 0 is a (δ, r)-QCSP, then Algorithm 1 with probability at least 1 ξ finds a point x S B(r) with f(x) < Ω(δ) using O Lr2k2k

ξ deterministic gradient oracle calls.

For the rest of the section, we assume that 0 is a (δ, r)- QCSP, i.e. minx S B(r) f(x) < δ. We consider two cases depending on whether 0 is a (δ, r)-boundary QCSP.

Case 1: 0 is a (δ, r)-boundary QCSP. For a subset of inequality constraints I [k] we define the subspace where these constraints are active: AI = {x | A i x = 0 for all i I}. Let I be a maximal4 subset of constraints such that minx AI B(r) f(x) < δ. If P is a projection operator on AI, it suffices to optimize g(x) := f(Px) = 1

2x (PMP)x. Therefore, we reduced the original problem to minimizing a different quadratic form in the same feasible set. For any i I, A i Px 0 holds trivially, since A i y = 0 for any y A, and hence constraints from I can be ignored. If a constraint not from I is active in Px, then f(Px) δ, since I is a maximal subset of constraints with minx AI B(r) < δ. Therefore, this reduces Case 1 to the next case.

Case 2: 0 is not a (δ, r)-boundary QCSP. In this case, we show that any x B(r) with f(x) < δ must lie in S, and for f(x) = 1

2x Mx it suffices to find the eigenvector corresponding to the smallest eigenvalue of M. We first show that there exists an eigenvector improving the objective.

4As we don t know I, Algorithm 1 tries all subsets of constraints.

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

Lemma 9. Let f(x) = 1

2x Mx and S = {x | Ax 0}. If 0 is not a (δ, r)-boundary QCSP for δ, r > 0, then the following statements are equivalent: 1. 0 is (δ, r)-QCSP, i.e. minx S B(r) f(x) < δ. 2. There exists an eigenvector e of M such that e Int S B(r) and f(e) < δ. Finding an exact eigenvector might be impossible. Hence, we show that finding x B(r) with f(x) < δ suffices, since either x or x are in S. Lemma 10. Let f(x) = 1

2x Mx and S = {x | Ax 0}. For δ, r > 0 and ˆx B(r), if f(ˆx) < δ and the following conditions hold, then either ˆx S or ˆx S: 1. minx S B(r) f(x) < δ, i.e. 0 is a (δ, r)-QCSP, 2. minx S B(r) f(x) δ, i.e 0 is not a (δ, r)-boundary QCSP, To show Lemma 10, we use the fact that, by Lemma 9, there exists an eigenvector e with e Int S B(r) and f(e) < δ. For the sake of contradiction, if both ˆx and ˆx are not in S, at least one of them has a non-negative inner product with e. W.l.o.g. we assume ˆx e 0, and we consider an arc on B(r) connecting ˆx and e. Since e S and ˆx / e, the arc intersects S at some point. We show that any point x on the arc has f(x) < δ, and hence this also holds for the point on the boundary, contradicting the assumption that 0 is not a (δ, r)-boundary QCSP, finishing the proof. The main idea behind the algorithm is that Algorithm 2 emulates the power method on matrix I M

L . Hence, for any ε it allows us to find a vector x such that x (I M

x 2 (1 ε)λ|max|

Since λ|max|(M) L, all eigenvalues of I M

L are positive, and hence the power method approximates the eigenvector corresponding to the largest eigenvalue of I M

L , and hence to the smallest eigenvalue of M. We know that f(x) < δ, and we aim to find x B(r) with f(x) < (1 ε)δ for a constant ε. Since there exists an eigenvector e B(r) of M with 1

2e Me < δ, we have λmin(M) < 2δ

r2 , and hence the largest eigenvalue of

L is at least 1 λmin(M)

L 1 + 2δ Lr2 . Finding x with 1 2x Mx < (1 ε)δ is equivalent to finding x with x (I

L )x (1 + 2(1 ε)δ

Lr2 ) x 2, and the power method achieves this in O(log d + Lr2

εδ ) iterations (see proof of Lemma 11 in the full version). Lemma 11. Let δ, r > 0, x Rd and ε (0, 1). Let f(x) = 1 2x Mx with λ|max|(M) L. Let A be a linear subspace of Rd. If minx A B(r) f(x) < δ, then Algorithm 2 finds

x A B(r) with f(x) (1 ε)δ after T = O Lr2

iterations w.h.p. Finally, we now prove Theorem 8. First, by exhaustive search, we guess a maximal subset of active constraints I such that subspace AI formed by these linear constraints has x B(r) S with f(x) < δ. Using Algorithm 2, we find y B(r) AI with f(y) < (1 ε)δ. Then y S by Lemma 11, since I is a maximal subset of constraints with an escape direction.

Algorithm 3: HOUDINIESCAPE(x, S, δ): Escaping from a saddle point

input : Saddle point x, δ from definition of δ-SOSP, feasible set S = {y | Ay b Rk} output: either reports that x is a δ-SOSP or finds u S B(x, r) with f(u) < f(x) Ω(δ)

1 Construct f (x + h) quadratic approximation of f(x + h) using stochastic gradient oracle calls

2 for I every subset of constraints do

3 A {y | A i x = bi, i I}, where Ai is the i-th row of A // Optimize in A

4 Let p be the projection of x on A

5 Let O Rd dim A be an orthonormal basis of A

6 Define g(y) := f (p + Oy)

7 Algorithm 4 tries to find an escape direction for g

8 if Algorithm 4 finds direction y then

9 return p + Oy

10 Didn t find escape direction: report that x is a δ-SOSP

Algorithm 4: FINDINSIDE(x, δ, (M , v ), (r , S ))

input : δ from definition of δ-SOSP, g(y) = 1

2y M y + y v objective in Rdim A, S B(r ) feasible set in Rdim A output: Escaping direction, if exists

1 If any of the following candidates lies in the feasible set and decreases the objective by Ω(δ), return it:

2 Case 1. Large gradient: argminy S B(r ) y v

3 Case 2. Solution in the interior: y M 1 v

4 Case 3. Solution on the boundary:

5 Find orthogonal Q and diagonal Λ = diag(λ1, . . . , λdim A) such that M Q ΛQ

7 We consider points with coordinates yi vi µj λi for some µ

8 Find the values of µ for which the points have norm r

9 The candidates are the points corresponding to these values of µ

10 If no candidate satisfies the condition, return

3 Main Result Our main result is the following theorem that shows how to find a δ-SOSP. Theorem 12. Let S = {x | Ax b} be a set defined by an intersection of k linear inequality constraints. Let f satisfy Assumptions 1 and 2 and let minx S f(x) = f . Then there exists an algorithm which for δ > 0 finds a δ-SOSP in

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

δ d3(2k+ σ2

δ4/3 )) time using O( f(x0) f

δ4/3 )) stochastic gradient oracle calls.

Our approach is outlined in Algorithms 3 and 45:

If x is not a δ-SOSP, Algorithm 3 finds an escape direction, i.e. a point y S B(x, r) which significantly decreases the objective value: f(y) < f(x) Ω(δ). Therefore, if x0 is the initial point, our algorithm requires O( f(x0) f

δ ) calls of Algorithm 3.

Algorithm 3 enumerates all possible sets of active constraints

Recall that we aim to find a δ-SOSP, i.e. x S such that f(y) f(x) δ, y S B(x, r), where r = 3p

δ/ρ. Similarly to Section 2, we use a quadratic approximation and consider two cases depending on whether the minimizer y lies in the interior or on the boundary of S. The second case can be reduced to the first in a way similar to Section 2. To find the minimizer in the interior, we consider the following cases in Algorithm 3: Case 1. When the gradient is large, we ignore the quadratic term, and optimize the function based on the gradient. This covers the situation when the objective is sensitive to the change in the argument. Case 2. When the optimum lies in Int B(x, r), none of the constraints are active, and hence we can find the unique unconstrained critical point directly. Case 3. When the optimum lies in B(x, r), the only active constraint is c(y) = y x 2 r2 = 0. By KKT conditions, for y to be the minimizer, there must exist µ such that f(y) = µ c(y). We show that for each µ, there exists a unique y(µ) satisfying the condition above, and only for O(d) values of µ we have y(µ) B(x, r), resulting in O(d) candidate solutions. We now outline the proof of Theorem 12 (full proof is in the full version). As shown in Section 3.1, we can consider a quadratic approximation of the objective. By guessing which constraints are active at the minimizer x and enforcing these constraints, we restrict the function to some affine subspace A. By parameterizing A, we eliminate enforced constraints, and, since the rest of the constraints are not active at x , we need to optimize a quadratic function in the intersection of a ball and linear inequality constraints (see the full version).

3.1 Algorithm 3: Quadratic Approximation

The goal is to build a quadratic approximation of the objective with some additional properties (see below). To simplify the presentation, w.l.o.g. (by shifting the coordinate system) we assume that the current saddle point is 0 and consider the quadratic approximation of the function:

f(x) f(0) + x f(0) + 1

Since f is ρ-Lipschitz and r = 3p

δ/ρ, in B(r) the quadratic approximation deviates from f by at most δ

6 (see derivation

5Algorithm 3 and 4 are simplified versions of rigorous algorithms in the full version

before Definition 4). For a small value ξ (to be specified later), we instead analyze a noisy function

f (x) = f(0) + x v + 1

1. v is a perturbed approximation of f(0). The perturbation guarantees that w.h.p. all coordinates of v are sufficiently separated from 0 and linear systems of the form (M µI)x = v with rank(M µI) < d don t have any solutions, simplifying the analysis. To approximate the gradient, we use algorithm VRSG (see the full version), which w.h.p. estimates the gradient with precision σ using O( σ2

σ2 ) stochastic gradient oracle calls.

2. M is a perturbed approximation of 2f(0). The perturbation guarantees that M is non-degenerate with probability 1. Since the i-th column of 2f(0) is by definition lim τ 0

f(τei) f(0)

τ , using a sufficiently small τ and ap-

proximating f(τei) and f(0) using VRSG, we find good approximation of the Hessian.

Combining this with the derivation before Definition 4, we show the following:

Lemma 13. Let f satisfy Assumptions 1 and 2. Let f (x) = f(0) + x v + 1

2x Mx, where v and M are as in Algorithm 3. For δ > 0, r = 3p

δ/ρ we have f (x) f(x) < δ

2 for all x B(r) w.h.p.

Reducing Case x S to Case x Int S. Similarly to Section 2, we reduce the case x S to the case x Int S. If x S, then there exist a non-empty set I of constraints active at x . Consider the iteration of Algorithm 3 where constraints from I are active. These active constraints define an affine subspace A, which e parameterize: if p = Proj A(x) and O Rd dim A is an orthonormal basis of A, then any point in A can be represented as p + Oy for y Rdim A. Defining g(y) = f (p + Oy), minimizing f in A S B(r) is equivalent to minimizing g in S B(r ), where:

1. S is a set of points y Rdim A such that p + Oy S, namely S = {y | Ai(p + Oy) bi, i / I}. Hence, S is defined by linear inequalities, similarly to S.

2. r is a radius such that condition y B(r ) is equivalent to p + Oy B(r). Since p is the projection of 0 on A, we have O p = 0, and hence p + Oy 2 = p 2 + Oy 2. Since O is an orthonormal basis of A, Oy = y , and hence r = p

For y such that x = p + Oy , no constraints from S are active, and hence x Int S .

3.2 Algorithm 4: Escaping When y Int S In this section, we the find minimizer y of function g(y) = 1 2y M y + y v + C in S B(r ), while assuming that y Int S . Since the solutions we find can be approximate, we have to guarantee that the objective is not too sensitive to the change of its argument. It suffices to consider the case

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

(a) Case 0: the minimizer is on S. We handle this case when processing the corresponding set of active constraints in Algorithm 3.

(b) Case 1: large linear term. We optimize the objective using only the linear term. This case is separate to avoid numerical issues in other cases

(c) Case 2: the minimizer is in the interior. We can find the critical point analytically, by solving a linear equation.

(d) Case 3: the minimizer is on the boundary. Then, there are at most O(d) candidates with norm r

Figure 2: Cases of Algorithm 4

when v is bounded, since for any y B(r ) and perturbation h there exists τ [0, 1] such that:

|g(y) g(y + h)| = |( g(y + τh)) h| ( g(0) + L y + τh ) h ( v + L(r + h )) h ,

where we used that the objective is L-smooth and hence g(y + τh) g(0) + L y + τh . We consider the situation when v is large as a separate case. Otherwise, for y , there are only two options: either y Int B(r ) or y B(r ). Algorithm 4 handles these cases, as well as the case when v is large, separately. Case 1: v is large. If v is large and we can find y with small y v , the linear term alone suffices to improve the objective. We show that, if such y doesn t exist, then g(y ) requires y S , which contradicts that y Int S . Below we assume that v is bounded. Case 2: y Int B(r ). In this case, y is an unconstrained critical point of g, and hence it must satisfy g(y) = 0, implying M y + v = 0 which gives the unique solution y = M 1 v . since M is a perturbed matrix, so is M , and hence M is non-degenerate with probability 1. It remains to verify that y B(r ) S and y decreases the objective by Ω(δ). Case 3: y B(r ). Since the only active constraint at y is c(y) = 1

2( y 2 r2 ) = 0, by the KKT conditions, any critical point must satisfy g(y) = µ c(y) for some µ R, which is equivalent to M y + v = µy. Hence, for any fixed µ, y must be a solution of linear system (M µI)y = v . When M µI is degenerate (i.e. µ is an eigenvalue of M ), due to perturbation of v , the system doesn t have any solution with probability 1. It leaves us with the case when M µI is non-degenerate, when there exists a unique solution y(µ) := (M µI) 1v .

Diagonalization. Dependence of (M µI) 1 on µ is non-trivial, but for a diagonal M , the inverse can be found explicitly. Hence, we perform diagonalization of M : we find orthogonal Q and diagonal Λ such that M Q ΛQ < ε in time O(d3 log 1/ε) [Pan and Chen, 1999]. Setting ε = O(δ/r2 ), we guarantee that the function changes by at most O(δ) in B(r ). The function y(µ) = (Q ΛQ µI) 1v can be written as Qy(µ) = (Λ µI)Qv , and hence we work with rotated vectors y(µ) := Qy(µ) and v := Qv . Finding candidate µ. Since y(µ) = (M µI) 1 v, for the i-th coordinate of y(µ) we have yi(µ) = vi µ λi . Since we are only interested in y(µ) B(r ) and Q is an orthogonal matrix, we must have:

r 2 = y(µ) 2 = y(µ) 2 =

i=1 yi(µ)2 = X

v2 i (µ λi)2

After multiplying the equation by Q i(µ λi)2, we get an equation of the form p(µ) = 0, where p is a polynomial of degree 2d. We find roots µ1, . . . , µ2d of the polynomial in time O(d2 log d log log 1/ε) [Pan, 1987], where ε is the required root precision. For each i, we compute y(µi) and verify whether it lies in B(r ) S and improves the objective by Ω(δ). Precision. We find the roots of the polynomial approximately, and when µ is close to λi for some i, even a small perturbation of µ can strongly affect yi(µ) = vi µ λi . We solve this as follows: since y(µ) = r , for all i we have | yi(µ)| r , implying |µ λi| | vi|

r . Therefore, µ must be sufficiently far from any λi, where the lower bound on the distance depends on r 3p

δ/ρ and on | vi|. Noise added to v is preserved in v, and each coordinate is sufficiently separated from 0 w.h.p. This reasoning is formalized in the full version.

4 Conclusion In this paper, we have shown that it s possible to escape from a constrained second-order stationary point with the logarithmic number of constraints within polynomial time and using only a polynomial number of stochastic gradient oracle calls. We provide experimental results in the full version. An open question is to determine the conditions that on one hand guarantee escaping from a saddle point in polynomial time even for the linear number of constraints, and on the other hand hold in practice. One such condition can be strict complementarity. Another open question is handling non-linear constraints. We believe that it can be straightforwardly achieved using techniques from [Ge et al., 2015] by using assumptions on curvature and linear independence of the constraints. Finally, an interesting question would be a simpler algorithm for the general case, e.g. an algorithm resembling the approach from Section 2.

References [Allen-Zhu and Li, 2018] Zeyuan Allen-Zhu and Yuanzhi Li. NEON2: finding local minima via first-order oracles.

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol o Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pages 3720 3730, 2018. [Allen-Zhu, 2018] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol o Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pages 2680 2691, 2018. [Anandkumar and Ge, 2016] Animashree Anandkumar and Rong Ge. Efficient approaches for escaping higher order saddle points in non-convex optimization. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 81 102. JMLR.org, 2016. [Avdiukhin et al., 2019] Dmitrii Avdiukhin, Chi Jin, and Grigory Yaroslavtsev. Escaping saddle points with inequality constraints via noisy sticky projected gradient descent. In Optimization for Machine Learning Workshop, 2019. [Bertsekas, 1997] Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3):334 334, 1997. [Bian et al., 2015a] Wei Bian, Xiaojun Chen, and Yinyu Ye. Complexity analysis of interior point algorithms for nonlipschitz and nonconvex minimization. Mathematical Programming, 149(1):301 327, 2015. [Bian et al., 2015b] Wei Bian, Xiaojun Chen, and Yinyu Ye. Complexity analysis of interior point algorithms for nonlipschitz and nonconvex minimization. Mathematical Programming, 149(1):301 327, 2015. [Boyd and Vandenberghe, 2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [Bubeck and others, 2015] S ebastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015. [Carmon and Duchi, 2018] Yair Carmon and John C. Duchi. Analysis of krylov subspace solutions of regularized nonconvex quadratic problems. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol o Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pages 10728 10738, 2018. [Carmon and Duchi, 2020] Yair Carmon and John C. Duchi. First-order methods for nonconvex quadratic minimization. SIAM Rev., 62(2):395 436, 2020.

[Carmon et al., 2017] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. convex until proven guilty : Dimension-free acceleration of gradient descent on nonconvex functions. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 654 663. PMLR, 2017.

[Cartis et al., 2018] Coralia Cartis, Nick IM Gould, and Philippe L Toint. Second-order optimality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear optimization. Foundations of Computational Mathematics, 18(5):1073 1107, 2018.

[Daneshmand et al., 2018] Hadi Daneshmand, Jonas Moritz Kohler, Aur elien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1163 1172. PMLR, 2018.

[Du et al., 2017] Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Aarti Singh, and Barnab as P oczos. Gradient descent can take exponential time to escape saddle points. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 1067 1077, 2017.

[Fang et al., 2018] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol o Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pages 687 697, 2018.

[Ge et al., 2015] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. In Peter Gr unwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, volume 40 of JMLR Workshop and Conference Proceedings, pages 797 842. JMLR.org, 2015.

[Haeser et al., 2019] Gabriel Haeser, Hongcheng Liu, and Yinyu Ye. Optimality condition and complexity analysis for linearly-constrained optimization without differentiability on the boundary. Mathematical Programming, 178(1):263 299, 2019.

[Hsia and Sheu, 2018] Yong Hsia and Ruey-Lin Sheu. Trust region subproblem with a fixed number of additional linear

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)

inequality constraints has polynomial complexity. ar Xiv preprint ar Xiv:1312.1398, 2018. [Jin et al., 2017] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1724 1732. PMLR, 2017. [Jin et al., 2018] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In S ebastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1042 1085. PMLR, 06 09 Jul 2018. [Jin et al., 2021] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. J. ACM, 68(2):11:1 11:29, 2021. [Lu et al., 2020] Songtao Lu, Meisam Razaviyayn, Bo Yang, Kejun Huang, and Mingyi Hong. Finding second-order stationary points efficiently in smooth nonconvex linearly constrained optimization problems. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. [Mokhtari et al., 2018] Aryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie. Escaping saddle points in constrained optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. [Murty and Kabadi, 1987] Katta G. Murty and Santosh N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39:117 129, 1987. [Nesterov and Polyak, 2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006. [Nocedal and Wright, 1999] Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999. [Nouiehed and Razaviyayn, 2020] Maher Nouiehed and Meisam Razaviyayn. A trust region method for finding second-order stationarity in linearly constrained nonconvex optimization. SIAM J. Optim., 30(3):2501 2529, 2020. [Nouiehed et al., 2020] Maher Nouiehed, Jason D Lee, and Meisam Razaviyayn. Convergence to second-order stationarity for constrained non-convex optimization. ar Xiv preprint ar Xiv:1810.02024, 2020. [O Neill and Wright, 2020] Michael O Neill and Stephen J Wright. A log-barrier Newton-CG method for bound con-

strained optimization with complexity guarantees. IMA Journal of Numerical Analysis, 41(1):84 121, 04 2020. [Pan and Chen, 1999] Victor Y. Pan and Zhao Q. Chen. The complexity of the matrix eigenproblem. In Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, STOC 99, page 507 516, New York, NY, USA, 1999. Association for Computing Machinery. [Pan, 1987] V. Pan. Sequential and parallel complexity of approximate evaluation of polynomial zeros. Computers & Mathematics with Applications, 14(8):591 622, 1987. [Staib et al., 2019] Matthew Staib, Sashank Reddi, Satyen Kale, Sanjiv Kumar, and Suvrit Sra. Escaping saddle points with adaptive gradient methods. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5956 5965. PMLR, 09 15 Jun 2019. [Tripuraneni et al., 2018] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I. Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol o Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pages 2904 2913, 2018. [Xu et al., 2018] Yi Xu, Rong Jin, and Tianbao Yang. Firstorder stochastic algorithms for escaping from saddle points in almost linear time. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol o Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pages 5535 5545, 2018. [Zhang and Li, 2021] Chenyi Zhang and Tongyang Li. Escape saddle points by a simple gradient-descent based algorithm. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8545 8556. Curran Associates, Inc., 2021. [Zhou and Gu, 2020] Dongruo Zhou and Quanquan Gu. Stochastic recursive variance-reduced cubic regularization methods. In Silvia Chiappa and Roberto Calandra, editors, The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pages 3980 3990. PMLR, 2020. [Zhou et al., 2020] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization. J. Mach. Learn. Res., 21:103:1 103:63, 2020.

Proceedings of the Thirty-Second International Joint Conference on Artiﬁcial Intelligence (IJCAI-23)