# efficiently_escaping_saddle_points_on_manifolds__c8a5ed7c.pdf

Efﬁciently escaping saddle points on manifolds

Chris Criscitiello Department of Mathematics

Princeton University Princeton, NJ 08544 ccriscitiello6@gmail.com

Nicolas Boumal Department of Mathematics

Princeton University Princeton, NJ 08544 nboumal@math.princeton.edu

Smooth, non-convex optimization problems on Riemannian manifolds occur in machine learning as a result of orthonormality, rank or positivity constraints. Firstand second-order necessary optimality conditions state that the Riemannian gradient must be zero, and the Riemannian Hessian must be positive semideﬁnite. Generalizing Jin et al. s recent work on perturbed gradient descent (PGD) for optimization on linear spaces [How to Escape Saddle Points Efﬁciently (2017) [17], Stochastic Gradient Descent Escapes Saddle Points Efﬁciently (2019) [18]], we propose a version of perturbed Riemannian gradient descent (PRGD) to show that necessary optimality conditions can be met approximately with high probability, without evaluating the Hessian. Speciﬁcally, for an arbitrary Riemannian manifold M of dimension d, a sufﬁciently smooth (possibly non-convex) objective function f, and under weak conditions on the retraction chosen to move on the manifold, with high probability, our version of PRGD produces a point with gradient smaller than and Hessian within p of being positive semideﬁnite in O((log d)4/ 2) gradient queries. This matches the complexity of PGD in the Euclidean case. Crucially, the dependence on dimension is low. This matters for large-scale applications including PCA and low-rank matrix completion, which both admit natural formulations on manifolds. The key technical idea is to generalize PRGD with a distinction between two types of gradient steps: steps on the manifold and perturbed steps in a tangent space of the manifold. Ultimately, this distinction makes it possible to extend Jin et al. s analysis seamlessly.

1 Introduction

Machine learning has stimulated interest in obtaining global convergence rates in non-convex optimization. Consider a possibly non-convex objective function f : Rd ! R. We want to solve

min x2Rd f(x). (1)

This is hard in general. Instead, we usually settle for approximate ﬁrst-order critical (or stationary) points where the gradient is small, or second-order critical (or stationary) points where the gradient is small and the Hessian is nearly positive semideﬁnite.

One of the simplest algorithms for solving (1) is gradient descent (GD): given x0, iterate

xt+1 = xt rf(xt). (2)

It is well known that if rf is Lipschitz continuous, with appropriate step-size , GD converges to ﬁrst-order critical points. However, it may take exponential time to reach an approximate secondorder critical point, thus, to escape saddle points [14]. There is an increasing amount of evidence that saddle points are a serious obstacle to the practical success of local optimization algorithms such as

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

GD [25, 16]. This calls for algorithms which provably escape saddle points efﬁciently. We focus on methods which only have access to f and rf (but not r2f) through a black-box model.

Several methods add noise to GD iterates in order to escape saddle points faster, under the assumption that f has L-Lipschitz continuous gradient and -Lipschitz continuous Hessian. In this setting, an -second-order critical point is a point x satisfying krf(x)k and r2f(x) p I. Under the strict saddle assumption, with small enough, such points are near (local) minimizers [16, 17].

In 2015, Ge et al. [16] gave a variant of stochastic gradient descent (SGD) which adds isotropic noise to iterates, showing it produces an -second-order critical point with high probability in O(poly(d)/ 4) stochastic gradient queries. In 2017, Jin et al. [17] presented a variant of GD, perturbed gradient descent (PGD), which reduces this complexity to O((log d)4/ 2) full gradient queries. Recently, Jin et al. [18] simpliﬁed their own analysis of PGD, and extended it to stochastic gradient descent.

Jin et al. s PGD [18, Alg. 4] works as follows: If the gradient is large at iterate xt, krf(xt)k > , then perform a gradient descent step: xt+1 = xt rf(xt). If the gradient is small at iterate xt, krf(xt)k , perturb xt by , with sampled uniformly from a ball of ﬁxed radius centered at zero. Starting from this new point xt + , perform T gradient descent steps, arriving at iterate xt+T . From here, repeat this procedure starting at xt+T . Crucially, Jin et al. [18] show that, if xt is not an -second-order critical point, then the function decreases enough from xt to xt+T with high probability, leading to an escape.

In this paper we generalize PGD to optimization problems on manifolds, i.e., problems of the form

min x2M f(x) (3)

where M is an arbitrary Riemannian manifold and f : M ! R is sufﬁciently smooth [3]. Optimization on manifolds notably occurs in machine learning (e.g., PCA [35], low-rank matrix completion [12]), computer vision (e.g., [32]) and signal processing (e.g., [2]) see [4] for more. See [29] and [26] for examples of the strict saddle property on manifolds.

Given x 2 M, the (Riemannian) gradient of f at x, grad f(x), is a vector in the tangent space at x, Tx M. To perform gradient descent on a manifold, we need a way to move on the manifold along the direction of the gradient at x. This is provided by a retraction Retrx: a smooth map from Tx M to M. Riemannian gradient descent (RGD) performs steps on M of the form

xt+1 = Retrxt( grad f(xt)). (4)

For Euclidean space, M = Rd, the standard retraction is Retrx(s) = x+s, in which case (4) reduces to (2). For the sphere embedded in Euclidean space, M = Sd Rd+1, a natural retraction is given by metric projection to the sphere: Retrx(s) = (x + s)/ kx + sk.

For x 2 M, deﬁne the pullback ˆfx = f Retrx : Tx M ! R, conveniently deﬁned on a linear space. If Retr is nice enough (details below), the Riemannian gradient and Hessian of f at x equal the (classical) gradient and Hessian of ˆfx at the origin of Tx M. Since Tx M is a vector space, if we perform GD on ˆfx, we can almost directly apply Jin et al. s analysis [18]. This motivates the two-phase structure of our perturbed Riemannian gradient descent (PRGD), listed as Algorithm 1.

Our PRGD is a variant of RGD (4) and a generalization of PGD. It works as follows: If the gradient is large at iterate xt 2 M, kgrad f(xt)k > , perform an RGD step: xt+1 = Retrxt( grad f(xt)). We call this a step on the manifold. If the gradient at iterate xt is small, kgrad f(xt)k , then perturb in the tangent space Txt M. After this perturbation, execute at most T gradient descent steps on the pullback ˆfxt, in the tangent space. We call these tangent space steps. We denote this sequence of T tangent space steps by {sj}j 0. This sequence of steps is performed by TANGENTSPACESTEPS: a deterministic, vector-space procedure see Algorithm 1.

By distinguishing between gradient descent steps on the manifold and those in a tangent space, we can apply Jin et al. s analysis almost directly [18], allowing us to prove PRGD reaches an -secondorder critical point on M in O((log d)4/ 2) gradient queries. Regarding regularity of f, we require its pullbacks to satisfy Lipschitz-type conditions, as advocated in [11, 7]. The analysis is far less technical than if one runs all steps on the manifold. We expect that this two-phase approach may prove useful for the generalization of other algorithms and analyses from the Euclidean to the Riemannian realm.

Recently, Sun and Fazel [30] provided the ﬁrst generalization of PGD to certain manifolds with a polylogarithmic complexity in the dimension, improving earlier results by Ge et al. [16, App. B] which had a polynomial complexity. Both of these works focus on submanifolds of a Euclidean space, with the algorithm in [30] depending on the equality constraints chosen to describe this submanifold.

At the same time as the present paper, Sun et al. [31] improved their analysis to cover any complete Riemannian manifold with bounded sectional curvature. In contrast to ours, their algorithm executes all steps on the manifold. Their analysis requires the retraction to be the Riemannian exponential map (i.e., geodesics). Our regularity assumptions are similar but different: while we assume Lipschitz-type conditions on the pullbacks in small balls around the origins of tangent spaces, Sun et al. make Lipschitz assumptions on the cost function directly, using parallel transport and Riemannian distance. As a result, curvature appears in their results. We make no explicit assumptions on M regarding curvature or completeness, though these may be implicit in our regularity assumptions: see Section 4.

Algorithm 1 PRGD(x0, , r, T , , T, b)

1: t 0 2: while t T do 3: if kgrad f(xt)k > then 4: xt+1 TANGENTSPACESTEPS(xt, 0, , b, 1) . Riemannian gradient descent step 5: t t + 1 6: else 7: Uniform(Bxt,r(0)) . perturb 8: s0 = 9: xt+T TANGENTSPACESTEPS(xt, s0, , b, T ) . perform T steps in Txt M 10: t t + T 11: end if 12: end while 13: 14: procedure TANGENTSPACESTEPS(x, s0, , b, T ) 15: for j = 0, 1, . . . , T 1 do 16: sj+1 sj r ˆfx(sj) 17: if ksj+1k b then . if the iterate leaves the interior of the ball Bx,b(0)

18: s T sj r ˆfx(sj), where 2 (0, 1] and

!!!sj r ˆfx(sj)

19: break 20: end if 21: end for 22: return Retrx(s T ) 23: end procedure

1.1 Main result

Here we state our result informally. Formal results are stated in subsequent sections. Theorem 1.1 (Informal). Let M be a Riemannian manifold of dimension d equipped with a retraction Retr. Assume f : M ! R is twice continuously differentiable, and furthermore:

A1. f is lower bounded.

A2. The gradients of the pullbacks f Retrx uniformly satisfy a Lipschitz-type condition.

A3. The Hessians of the pullbacks f Retrx uniformly satisfy a Lipschitz-type condition.

A4. The retraction Retr uniformly satisﬁes a second-order condition.

Then, setting T = O((log d)4/ 2), PRGD visits several points with gradient smaller than and, with

high probability, at least two-thirds of those points are -second-order critical (Deﬁnition 3.1).

PRGD uses O((log d)4/ 2) gradient queries, and crucially no Hessian queries. The algorithm requires knowledge of the Lipschitz constants deﬁned below, which makes this a mostly theoretical algorithm but see Appendix D for explicit constants in the case of PCA.

1.2 Other related work

Algorithms which efﬁciently escape saddle points can be classiﬁed into two families: ﬁrst-order and second-order methods. First-order methods only use function value and gradient information. SGD and PGD are ﬁrst-order methods. Second-order methods also access Hessian information. Newton s method, trust regions [24, 11] and adaptive cubic regularization [23, 7, 34] are second-order methods.

As noted above, Ge et al. [16] and Jin et al. [17] escape saddle points (in Euclidean space) by exploiting noise in iterations. There has also been similar work for normalized gradient descent [20]. Expanding on [17], Jin et al. [19] give an accelerated PGD algorithm (PAGD) which reaches an -second-order critical point of a non-convex function f with high probability in O((log d)6/ 7/4) iterations. In [18], Jin et al. show that a stochastic version of PGD reaches an -second-order critical point in O(d/ 4) stochastic gradient queries; only O(poly(log d)/ 4) queries are needed if the stochastic gradients are well behaved. For an analysis of PGD under convex constraints, see [22].

There is another line of research, inspired by Langevin dynamics, in which judiciously scaled Gaussian noise is added at every iteration. We note that although this differs from the ﬁrst incarnation of PGD in [17], this resembles a simpliﬁed version of PGD in [18]. Sang and Liu [27] develop an algorithm (adaptive stochastic gradient Langevin dynamics, ASGLD), which provably reaches an -second-order critical point in O(log d/ 4) with high probability. With full gradients, AGSLD reaches an -second-order critical point in O(log d/ 2) queries with high probability.

One might hope that the noise inherent in vanilla SGD would help it escape saddle points without noise injection. Daneshmand et al. [13] propose the correlated negative curvature assumption (CNC), under which they prove that SGD reaches an -second-order critical point in O( 5) queries with high probability. They also show that, under the CNC assumption, a variant of GD (in which iterates are perturbed only by SGD steps) efﬁciently escapes saddle points. Importantly, these guarantees are completely dimension-free.

A ﬁrst-order method can include approximations of the Hessian (e.g., with a difference of gradients). For example, Allen-Zhu s Natasha 2 algorithm [8] uses ﬁrst-order information (function value and stochastic gradients) to search for directions of negative curvature of the Hessian. Natasha 2 reaches an -second-order critical point in O( 13/4) iterations.

Many classical optimization algorithms have been generalized to optimization on manifolds, including gradient descent, Newton s method, trust regions and adaptive cubic regularization [15, 3, 1, 6, 11, 7, 9, 34]. Bonnabel [10] extends stochastic gradient descent to Riemannian manifolds and proves that Riemannian SGD converges to critical points of the cost function. Zhang et al. [33] and Sato et al. [28] both use variance reduction to speed up SGD on Riemannian manifolds.

2 Preliminaries: Optimization on manifolds

We review the key deﬁnitions and tools for optimization on manifolds. For more information, see [3]. Let M be a d-dimensional Riemannian manifold: a real, smooth d-manifold equipped with a Riemannian metric. We associate with each x 2 M a d-dimensional real vector space Tx M, called the tangent space at x. For embedded submanifolds of Rn, we often visualize the tangent space as being tangent to the manifold at x. The Riemannian metric deﬁnes an inner product h , ix on the tangent space Tx M, with associated norm k kx. We denote these by h , i and k k when x is clear from context. A vector in the tangent space is a tangent vector. The set of pairs (x, sx) for x 2 M, sx 2 Tx M is called the tangent bundle TM. Deﬁne Bx,r(s) = { s 2 Tx M : k s skx r}: the closed ball of radius r centered at s 2 Tx M. We occasionally denote Bx,r(s) by Br(s) when x is clear from context. Let Uniform(Bx,r(s)) denote the uniform distribution over the ball Bx,r(s).

The Riemannian gradient gradf(x) of a differentiable function f at x 2 M is the unique vector in Tx M satisfying Df(x)[s] = hgrad f(x), six 8s 2 Tx M, where Df(x)[s] is the directional derivative of f at x along s. The Riemannian metric gives rise to a well-deﬁned notion of derivative of vector ﬁelds called the Riemannian (or Levi Civita) connection r. The Hessian of f is the derivative of the gradient vector ﬁeld: Hessf(x)[u] = rugradf(x). The Hessian describes how the gradient changes. Hessf(x) is a symmetric linear operator on Tx M. If the manifold is a Euclidean space, M = Rd, with the standard metric hx, yi = x T y, the Riemannian gradient gradf and Hessian Hessf coincide with the standard gradient rf and Hessian r2f (mind the overloaded notation r).

As discussed in Section 1, the retraction is a mapping which allows us to move along the manifold from a point x in the direction of a tangent vector s 2 Tx M. Formally:

Deﬁnition 2.1 (Retraction, from [3]). A retraction on a manifold M is a smooth mapping Retr from the tangent bundle TM to M satisfying properties 1 and 2 below. Let Retrx : Tx M ! M denote the restriction of Retr to Tx M.

1. Retrx(0x) = x, where 0x is the zero vector in Tx M.

2. The differential of Retrx at 0x, DRetrx(0x), is the identity map.

(Our algorithm and theory only require Retr to be deﬁned in balls of a ﬁxed radius around the origins of tangent spaces.) Recall these special retractions, which are good to keep in mind for intuition: on M = Rd, we typically use Retrx(s) = x + s, and on the unit sphere we typically use Retrx(s) = (x + s)/ kx + sk.

For x in M, deﬁne the pullback of f from the manifold to the tangent space by

ˆfx = f Retrx : Tx M ! R.

This is a real function on a vector space. Furthermore, for x 2 M and s 2 Tx M, let

Tx,s = DRetrx(s): Tx M ! TRetrx(s)M

denote the differential of Retrx at s (a linear operator). The gradient and Hessian of the pullback admit the following nice expressions in terms of those of f, and the retraction.

Lemma 2.2 (Lemma 5.2 of [7]). For f : M ! R twice continuously differentiable, x 2 M and s 2 Tx M, with T

x,s denoting the adjoint of Tx,s,

r ˆfx(s) = T

x,sgrad f(Retrx(s)), r2 ˆfx(s) = T

x,s Hess f(Retrx(s))Tx,s + Ws, (5)

where Ws is a symmetric linear operator on Tx M deﬁned through polarization by

h Ws[ s], si = hgrad f(Retrx(s)), γ00(0)i , (6)

with γ00(0) 2 TRetrx(s)M the intrinsic acceleration on M of γ(t) = Retrx(s + t s) at t = 0.

The velocity of a curve γ : R ! M is dγ

dt = γ0(t). The intrinsic acceleration γ00 of γ is the covariant derivative (induced by the Levi Civita connection) of the velocity of γ: γ00 = D

dtγ0. When M is a

Riemannian submanifold of Rn, γ00(t) does not necessarily coincide with d2γ

dt2 : in this case, γ00(t) is

the orthogonal projection of d2γ

dt2 onto Tγ(t)M.

3 PRGD efﬁciently escapes saddle points

We now precisely state the assumptions, the main result, and some important parts of the proof of the main result, including the main obstacles faced in generalizing PGD to manifolds. A full proof of all results is provided in the appendix.

3.1 Assumptions

The ﬁrst assumption, namely, that f is lower bounded, ensures that there are points on the manifold where the gradient is arbitrarily small.

Assumption 1. f is lower bounded: f(x) f for all x 2 M.

Generalizing from the Euclidean case, we assume Lipschitz-type conditions on the gradients and Hessians of the pullbacks ˆfx = f Retrx. For the special case of M = Rd and Retrx(s) = x + s, these assumptions hold if the gradient rf( ) and Hessian r2f( ) are each Lipschitz continuous, as in [18, A1] (with the same constants). The Lipschitz-type assumptions below are similar to assumption A2 of [7]. Notice that these assumptions involve both the cost function and the retraction: this dependency is further discussed in [11, 7] for a similar setting.

Assumption 2. There exist b1 > 0 and L > 0 such that 8x 2 M and 8s 2 Tx M with ksk b1,

!!!r ˆfx(s) r ˆfx(0)

!!! L ksk .

Assumption 3. There exist b2 > 0 and > 0 such that 8x 2 M and 8s 2 Tx M with ksk b2,

!!!r2 ˆfx(s) r2 ˆfx(0)

where on the left-hand side we use the operator norm.

More precisely, we only need these assumptions to hold at the iterates x0, x1, . . . Let b = min{b1, b2} (to reduce the number of parameters in Algorithm 1). The next assumption requires the chosen retraction to be well behaved, in the sense that the (intrinsic) acceleration of curves γx,s on the manifold, deﬁned below, must remain bounded compare with Lemma 2.2. Assumption 4. There exists β 0 such that, for all x 2 M and s 2 Tx M satisfying ksk = 1, the curve γx,s(t) = Retrx(ts) has initial acceleration bounded by β:

If Assumption 4 holds with β = 0, Retr is said to be second order [3, p107]. Second-order retractions include the so-called exponential map and the standard retractions on Rd and the unit sphere mentioned earlier see [5] for a large class of such retractions on relevant manifolds. Deﬁnition 3.1. A point x 2 M is an -second-order critical point of the twice-differentiable function f : M ! R satisfying Assumption 3 if

kgrad f(x)k , and λmin(Hess f(x)) p , (7)

where λmin(H) denotes the smallest eigenvalue of the symmetric operator H.

For compact manifolds, all of these assumptions hold (all proofs are in the appendix): Lemma 3.2. Let M be a compact Riemannian manifold equipped with a retraction Retr. Assume f : M ! R is three times continuously differentiable. Pick an arbitrary b > 0. Then, there exist f , L > 0, > 0 and β 0 such that Assumptions 1, 2, 3 and 4 are satisﬁed.

3.2 Main results

Recall that PRGD (Algorithm 1) works as follows. If kgrad f(xt)k > , perform a Riemannian gradient descent step, xt+1 = Retrxt( grad f(xt)). If kgrad f(xt)k , then perturb, i.e., sample Uniform(Bxt,r(0)) and let s0 = . After this perturbation, remain in the tangent space Txt M and do (at most) T gradient descent steps on the pullback ˆfxt, starting from s0. We denote this sequence of T tangent space steps by {sj}j 0. This sequence of gradient descent steps is performed by TANGENTSPACESTEPS: a deterministic procedure in the (linear) tangent space.

One difﬁculty with this approach is that, under our assumptions, for some x = xt, r ˆfx may not be Lipschitz continuous in all of Tx M. However, it is easy to show that r ˆfx is Lipschitz continuous in the ball of radius b by compactness, uniformly in x. This is why we limit our algorithm to these balls. If the sequence of iterates {sj}j 0 escapes the ball Bx,b(0) Tx M for some sj, TANGENTSPACESTEPS returns the point between sj 1 and sj on the boundary of that ball.

Following [18], we use a set of carefully balanced parameters. Parameters and δ are user deﬁned. The claim in Theorem 3.4 below holds with probability at least 1 δ. Assumption 1 provides parameter f . Assumptions 2 and 3 provide parameters L, and b = min{b1, b2}. As announced, the latter two assumptions further ensure Lipschitz continuity of the gradients of the pullbacks in balls of the tangent spaces, uniformly: this deﬁnes the parameter , as prescribed below. Lemma 3.3. Under Assumptions 2 and 3, there exists 2 [L, L + b] such that, for all x 2 M, the gradient of the pullback, r ˆfx, is -Lipschitz continuous in the ball Bx,b(0).

Then, choose χ > 1/4 (preferably small) such that

d(f(x0) f )

and set algorithm parameters

, r = 400χ3 , T = χ p , (9)

where χ is such that T is an integer. We also use this notation in the proofs:

Theorem 3.4. Assume f satisﬁes Assumptions 1, 2 and 3. For any x0 2 M, with 0 < b2 , L p , 3/2 3p (f(x0) f ) and δ 2 (0, 1), choose , r, T as in (9). Then, setting

3 , (f(x0) f )T

F , f(x0) f

PRGD(x0, , r, T , , T, b) visits at least two iterates xt 2 M satisfying kgrad f(xt)k . With probability at least 1 δ, at least two-thirds of those iterates satisfy

kgrad f(xt)k and λmin(r2 ˆfxt(0)) p .

The algorithm uses at most T + T 2T gradient queries (and no function or Hessian queries).

By Assumption 4 and Lemma 2.2, r2 ˆfxt(0) is close to Hess f(xt), which allows us to conclude:

Corollary 3.5. Assume f satisﬁes Assumptions 1, 2, 3 and 4. For an arbitrary x0 2 M, with 0 < min{ /β2, b2 }, L p , 3/2 3p (f(x0) f ) and δ 2 (0, 1), choose , r, T as in (9). Then, setting T as in (11), PRGD(x0, , r, T , , T, b) visits at least two iterates xt 2 M satisfying kgrad f(xt)k . With probability at least 1 δ, at least two-thirds of those iterates are (4 )-second-order points. If β = 0 (that is, the retraction is second order), then the same claim holds

for -second-order points instead of 4 . The algorithm uses at most T + T 2T gradient queries.

Assume M = Rd with standard inner product and standard retraction Retrx(s) = x + s. As in [18], assume f : Rd ! R is lower bounded, rf is L-Lipschitz in Rd, and r2f is -Lipschitz in Rd. Then, Assumptions 1, 2 and 3 hold with b = +1. Furthermore, Assumption 4 holds with β = 0 so that r2 ˆfx(0) = Hess f(x) = r2f(x) (Lemma 2.2). For all x 2 M, r ˆfx(s) has Lipschitz constant = L since ˆfx(s) = f(x + s). Therefore, using b = +1, = L and choosing , r, T as in (9), PRGD reduces to PGD, and Theorem 3.4 recovers the result of Jin et al. [18]: this conﬁrms that the present result is a bona ﬁde generalization.

For the important special case of compact manifolds, Lemmas 3.2 and 3.3 yield:

Corollary 3.6. Assume M is a compact Riemannian manifold equipped with a retraction Retr, and f : M ! R is three times continuously differentiable. Pick an arbitrary b > 0. Then, Assumptions 1, 2, 3, 4 hold for some L > 0, > 0, β 0, so that Corollary 3.5 applies with some 2 [L, L + b].

Remark 3.7. PRGD, like PGD (Algorithm 4 in [18]), does not specify which iterate is an -secondorder critical point. However, it is straightforward to include a termination condition in PRGD which halts the algorithm and returns a suspected -second-order critical point. Indeed, Jin et al. include such a termination condition in their original PGD algorithm [17], which here would go as follows: After performing a perturbation and T (tangent space) steps in Txt M, return xt if

ˆfxt(s T ) ˆfxt(0) > fthres, i.e., the function value does not decrease enough. The termination condition requires a threshold fthres which is balanced like the other parameters of PRGD in (9).

3.3 Main proof ideas

Theorem 3.4 follows from the following two lemmas which we prove in the appendix. These lemmas state that, in each round of the while-loop in PRGD, if xt is not at an -second-order critical point, PRGD makes progress, that is, decreases the cost function value (the ﬁrst lemma is deterministic, the second one is probabilistic). Yet, the value of f on the iterates can only decrease so much because f is bounded below by f . Therefore, the probability that PRGD does not visit an -second-order critical point is low.

Lemma 3.8. Under Assumptions 2 and 3, set = 1/ for some L. If x 2 M satisﬁes kgrad f(x)k > with b2 and L p , then,

f(TANGENTSPACESTEPS(x, 0, , b, 1)) f(x) 2/2.

Lemma 3.9. Under Assumptions 2 and 3, let x 2 M satisfy both kgrad f(x)k and λmin(r2 ˆfx(0)) p with b2 and L p . Set , r, T , F as in (9) and (10). Let s0 = with Uniform(Bx,r(0)). Then,

f(TANGENTSPACESTEPS(x, s0, , b, T )) f(x) F/2

d p 210 χ/2.

Lemma 3.8 states that we are guaranteed to make progress if the gradient is large. This follows from the sufﬁcient decrease of RGD steps. Lemma 3.9 states that, with perturbation, GD on the pullback escapes a saddle point with high probability. Lemma 3.9 is analogous to Lemma 11 in [18].

Let Xstuck be the set of tangent vectors s0 in Bx, r(0) for which GD on the pullback starting from s0 does not escape the saddle point, i.e., the function value does not decrease enough after T iterations. Following Jin et al. s analysis [18], we bound the width of this stuck region (in the direction of the eigenvector e1 associated with the minimum eigenvalue of the Hessian of the pullback, r2 ˆfx(0)). Like Jin et al., we do this with a coupling argument, showing that given two GD sequences with starting points sufﬁciently far apart, one of these sequences must escape. This is formalized in Lemma C.4 of the appendix. A crucial observation to prove Lemma C.4 is that, if the function value of GD iterates does not decrease much, then these iterates must be localized; this is formalized in Lemma C.3 of the appendix, which Jin et al. call improve or localize.

We stress that the stuck region concept, coupling argument, improve or local paradigm, and details of the analysis are due to Jin et al. [18]: our main contribution is to show a clean way to generalize the algorithm to manifolds in such a way that the analysis extends with little friction. We believe that the general idea of separating iterations between the manifold and the tangent spaces to achieve different objectives may prove useful to generalize other algorithms as well.

4 About the role of curvature of the manifold

As pointed out in the introduction, concurrently with our work, Sun et al. [31] have proposed another generalization of PGD to manifolds. Their algorithm executes all steps on the manifold directly (as opposed to our own, which makes certain steps in the tangent spaces), and moves around the manifold using the exponential map. To carry out their analysis, Sun et al. assume f is regular in the following way. The Riemannian gradient is Lipschitz continuous in a Riemannian sense, namely,

8x, y 2 M, kgrad f(y) Γy

xgrad f(x)k Ldist(x, y),

x : Tx M ! Ty M denotes parallel transport from x to y along any minimizing geodesic, and dist is the Riemannian distance. These notions are well deﬁned if M is a connected, complete manifold. Similarly, they assume the Riemannian Hessian of f is Lipschitz continuous in a Riemannian sense:

8x, y 2 M, k Hess f(y) Γy

x Hess f(x) Γx

yk dist(x, y),

in the operator norm. Using (and improving) sophisticated inequalities from Riemannian geometry, they map the perturbed sequences back to tangent spaces for analysis, where they run an adapted version of Jin et al. s argument. In so doing, it appears to be crucial to use the exponential map, owing to its favorable interplay with parallel transport along geodesics and Riemannian distance, providing a good ﬁt with the regularity conditions above.

As they map sequences back from the manifold to a common tangent space through the inverse of the exponential map, the Riemannian curvature of the manifold comes into play. Consequently, they assume M has bounded sectional curvature (both from below and from above), and these bounds on curvature come up in their ﬁnal complexity result: constants degrade if the manifold is more curved.

Since Riemannian curvature does not occur in our own complexity result for PRGD, it is legitimate to ask: is curvature supposed to occur? If so, it must be hidden in our analysis, for example in the

regularity assumptions we make, which are expressed in terms of pullbacks rather than with parallel transports. And indeed, in several attempts to deduce our own assumptions from those of Sun et al., invariably, we had to degrade L and as a function of curvature minding that these are only bounds. On the other hand, under the assumptions of Sun et al., one can deduce that the regularity assumptions required in [11, 7] for the analysis of Riemannian gradient descent, trust regions and adaptive regularization by cubics hold with the exponential map, leading to curvature-free complexity bounds for all three algorithms. Thus, it is not clear that curvature should occur.

We believe this poses an interesting question regarding the complexity of optimization on manifolds: to what extent should it be inﬂuenced by curvature of the manifold? We intend to study this.

5 Perspectives

To perform PGD (Algorithm 4 of [18]), one must know the step size , perturbation radius r and the number of steps T to perform after perturbation. These parameters are carefully balanced, and their values depend on the smoothness parameters L and . In most situations, we do not know L or (though see Appendix D for PCA). An algorithm which does not require knowledge of L or but still has the same guarantees as PGD would be useful. However, that certain regularity parameters must be known seems inevitable, in particular for the Hessian s . Indeed, the main theorems make statements about the spectrum of the Hessian, yet the algorithm is not allowed to query the Hessian.

GD equipped with a backtracking line-search method achieves an -ﬁrst-order critical point in O( 2) gradient queries without knowledge of the Lipschitz constant L. At each iterate xt of GD, backtracking line-search essentially uses function and gradient queries to estimate the gradient Lipschitz parameter near xt. Perhaps PGD can perform some kind of line-search to locally estimate L and . We note that if is known and we use line-search-type methods to estimate L, there are still difﬁculties applying Jin et al. s coupling argument.

Jin et al. [18] develop a stochastic version of PGD known as PSGD. Instead of perturbing when the gradient is small and performing T GD steps, PSGD simply performs a stochastic gradient step and perturbation at each step. Distinguishing between manifold steps and tangent space steps, we suspect it is possible to develop a Riemannian version of perturbed stochastic gradient descent which achieves an -second-order critical point in O(d/ 4) stochastic gradient queries, like PSGD. This Riemannian version performs a certain number of steps in the tangent space, like PRGD.

More broadly, we anticipate that it should be possible to extend several classical optimization methods from the Euclidean case to the Riemannian case through this approach of running many steps in a given tangent space before retracting. This ought to be particularly beneﬁcial for algorithms whose computations or analysis rely intimately on linear structures, such as for coordinate descent algorithms, certain parallelized schemes, and possibly also accelerated schemes. In preparing the ﬁnal version of this paper, we found that this idea is also the subject of another paper at Neur IPS 2019, where it is called dynamic trivialization [21].

Acknowledgments

We thank Yue Sun, Nicolas Flammarion and Maryam Fazel, authors of [31], for numerous relevant discussions. NB is partially supported by NSF grant DMS-1719558.

[1] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds.

Foundations of Computational Mathematics, 7(3):303 330, 2007. [2] P.-A. Absil and K. A. Gallivan. Joint diagonalization on the oblique manifold for independent

component analysis. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 5(945-948), 2006. [3] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds.

Princeton University Press, 2008. [4] P. A. Absil, R. Mahony, and R. Sepulchre. Optimization on manifolds: Methods and applications.

Recent Advances in Optimization and its Applications in Engineering, Springer, (125-144), 2010.

[5] P.-A. Absil and J. Malick. Projection-like retractions on matrix manifolds. SIAM Journal on

Optimization, 22(1)(135-158), 2012.

[6] R. Adler, J. Dedieu, J. Margulies, M. Martens, and M. Shub. Newton s method on Riemannian

manifolds and a geometric model for the human spine. IMA Journal of Numerical Analysis, 22(3)(359-390), 2002.

[7] N. Agarwal, N. Boumal, B. Bullins, and C. Cartis. Adaptive regularization with cubics on

manifolds. ar Xiv preprint ar Xiv:1806.00065, 2019.

[8] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Advances in

Neural Information Processing Systems, pages 2675 2686, 2018.

[9] G.C. Bento, O.P. Ferreira, and J.G. Melo. Iteration-complexity of gradient, subgradient and

proximal point methods on Riemannian manifolds. Journal of Optimization Theory and Applications, 173(2):548 562, 2017.

[10] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. Automatic Control, IEEE

Transactions on, 58(9):2217 2229, 2013.

[11] N. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization

on manifolds. IMA Journal of Numerical Analysis, 2018.

[12] Leopold Cambier and P. A. Absil. Robust low-rank matrix completion by riemannian optimiza-

tion. SIAM Journal on Scientiﬁc Computing, 38(5)(S440-S460), 2016.

[13] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles

with stochastic gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1155 1164, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

[14] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos.

Gradient descent can take exponential time to escape saddle points. In Advances in neural information processing systems, pages 1067 1077, 2017.

[15] A. Edelman, T.A. Arias, and S.T. Smith. The geometry of algorithms with orthogonality

constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303 353, 1998.

[16] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochastic

gradient for tensor decomposition. In Conference on Learning Theory, pages 797 842, 2015.

[17] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape

saddle points efﬁciently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724 1732. JMLR.org, 2017.

[18] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. Stochastic

gradient descent escapes saddle points efﬁciently. ar Xiv:1902.04811, 2019.

[19] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle

points faster than gradient descent. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1042 1085. PMLR, 06 09 Jul 2018.

[20] Kﬁr Y. Levy. The power of normalization: Faster evasion of saddle points. ar Xiv:1611.04831,

[21] Mario Lezcano Casado. Trivializations for gradient-based optimization on manifolds. In

Proceedings of Neur IPS, 2019.

[22] Aryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie. Escaping saddle points in constrained

optimization. In Advances in Neural Information Processing Systems, pages 3629 3639, 2018.

[23] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance.

Mathematical Programming, 108(1)(177-205), 2006.

[24] J. Nocedal and S. Wright. Numerical Optimization. Springer Verlag, 1999.

[25] Razvan Pascanu, Yann N. Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point

problem for non-convex optimization. ar Xiv:1405.4604, 2014.

[26] T. Pumir, S. Jelassi, and N. Boumal. Smoothed analysis of the low-rank approach for smooth

semideﬁnite programs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2283 2292. Curran Associates, Inc., 2018. [27] Hejian Sang and Jia Liu. Adaptive stochastic gradient Langevin dynamics: Taming convergence

and saddle point escape time. ar Xiv:1805.09416, 2018. [28] Hiroyuki Sato, Hiroyuki Kasai, and Bamdev Mishra. Riemannian stochastic variance reduced

gradient. SIAM Journal on Optimization, 29(2):1444 1472, 2017. [29] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere I: Overview

and the geometric picture. IEEE Transactions on Information Theory, 63(2):853 884, 2016. [30] Yue Sun and Maryam Fazel. Escaping saddle points efﬁciently in equality-constrained opti-

mization problems. ICML, 2018. [31] Yue Sun, Nicolas Flammarion, and Maryam Fazel. Escaping from saddle points on Riemannian

manifolds. In Proceedings of Neur IPS, 2019. [32] Pavan Turaga, Ashok Veeraraghavan, and Rama Chellappa. Statistical analysis on stiefel and

grassmann manifolds with applications in computer vision. IEEE Conference on Computer Vision and Pattern Recognition, 2008. [33] Hongyi Zhang, Sashank J. Reddi, and Suvrit Sra. Riemannian SVRG: Fast stochastic opti-

mization on Riemannian manifolds. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4592 4600. Curran Associates, Inc., 2016. [34] J. Zhang and S. Zhang. A cubic regularized Newton s method over Riemannian manifolds.

ar Xiv preprint ar Xiv:1805.05565, 2018. [35] Teng Zhang and Yi Yang. Robust pca by manifold optimization. Journal of Machine Learning

Research, 19(1-39), 2018.