# adagrad_avoids_saddle_points__c0b2f210.pdf ADAGRAD Avoids Saddle Points Kimon Antonakopoulos 1 2 Panayotis Mertikopoulos 3 4 Georgios Piliouras 5 Xiao Wang 6 Adaptive first-order methods in optimization are prominent in machine learning and data science owing to their ability to automatically adapt to the landscape of the function being optimized. However, their convergence guarantees are typically stated in terms of vanishing gradient norms, which leaves open the issue of converging to undesirable saddle points (or even local maximizers). In this paper, we focus on the ADAGRAD family of algorithms with scalar, diagonal or full-matrix preconditioning and we examine the question of whether the method s trajectories avoid saddle points. A major challenge that arises here is that ADAGRAD s step-size (or, more accurately, the method s preconditioner) evolves over time in a filtration-dependent way, i.e., as a function of all gradients observed in earlier iterations; as a result, avoidance results for methods with a constant or vanishing step-size do not apply. We resolve this challenge by combining a series of step-size stabilization arguments with a recursive representation of the ADAGRAD preconditioner that allows us to employ stable manifold techniques and ultimately show that the induced trajectories avoid saddle points from almost any initial condition. 1 Introduction Deep learning architectures have brought forth a revolution in numerous application areas, from computer vision and recommender systems, to speech recognition and natural language processing [6, 19]. Although gradient descent (and its variants) is the mainstay training tool for such models, Authors in alphabetical order. 1Laboratory for Information and Inference Systems, IEM, STI, EPFL, 1015 Lausanne, Switzerland. 2This work was done when KA was with Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France. 3Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France 4Criteo AI Lab 5Singapore University of Technology and Design 6Shanghai University of Finance and Economics. Correspondence to: Xiao Wang . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). it comes with a significant inherent drawback: the gradient steps taken at each iteration are essentially Markovian , in the sense that information gained about the model s loss landscape over time is not taken into account when performing an update. For this reason, adaptive gradient algorithms have emerged as an essential ingredient of contemporary machine learning models and architectures: by incorporating data and knowledge from gradients observed in earlier iterations, adaptive methods perform more informed gradient steps in later iterations, and they are able to adapt efficiently to the landscape of the function being optimized. The blueprint for most adaptive first-order methods including ADAM [17], ADAMNC [32], AMSGRAD [32], and RMSPROP is the ADAGRAD family of algorithms that was introduced concurrently by Duchi et al. [10] and Mc Mahan & Streeter [24]. In the unconstrained case (which is the playground of choice for most adaptive methods of this type), the ADAGRAD algorithm proceeds as a gradient descent method with a matrix-valued step-size typically referred to as a preconditioner which is defined recursively by taking the square root of the sum of squares of past gradients (possibly tensored, depending on the specific variant of the method). Owing to this clever preconditioning mechanism, ADAGRAD excels in solving convex-structured problems with sparse gradients, while remaining competitive in environments with full (dense) gradients. Specifically, in convex problems with Lipschitz continuous objectives, ADAGRAD attains an O(1/ T) value convergence rate after T queries to a first-order oracle (stochastic or deterministic). This rate improves to O(log T/T) if the problem s objective is strongly convex [10],1 and to O(1/T) if the problem s objective is Lipschitz smooth [1, 2, 22].2 In this regard, ADAGRAD is not order-optimal because it does not attain the iconic O(1/T 2) accelerated convergence rate of Nesterov [26]; however, other adaptive methods based on the ADAGRAD template do achieve this like the ACCELEGRAD proposal of [22], or the more recent UNIXGRAD and UNDERGRAD algorithms by [16] and [4, 34] respectively. In the non-convex world (which is of greater interest to 1Or if the noise in the method s gradient oracle is a relative percentage of the gradient norm, cf. the recent paper [3]. 2Importantly, the first of these rates continues to hold in the stochastic case, but the second does not. Ada Grad Avoids Saddle Points deep learning applications), ADAGRAD has been shown to attain an O(1/T) convergence rate when the optimizer has access to a perfect, deterministic gradient oracle, and an O(1/ T) convergence rate when only stochastic gradients are available [23, 35]. Importantly, the merit function in both cases is no longer the value of the objective function, but the sum of gradient norms squared. As a result, the above guarantees translate to a convergence rate for the method s best iterate , i.e., the queried point with the least (true) gradient norm.3 In this regard, ADAGRAD is an orderoptimal method in the non-convex case which, coupled with its simplicity and the capability of exploiting sparse gradients makes it an ideal choice for many problems with moderate-to-high dimensionality and a sparse solution structure. At the same time, it should be noted that the only guarantee provided in non-convex settings is that of a vanishing gradient, not a value minimization certificate (local or global). This leaves open not only the question of global versus local optimality, but an even more fundamental one: Do the trajectories of ADAGRAD avoid saddle points? This question is the core of our paper; to put our contributions in the proper context, we begin by discussing the related work on the topic. Related work. The literature on saddle-point avoidance is quite extensive, so some general remarks are in order. First, to the best of our knowledge, existing results concern almost exclusively non-adaptive methods (we discuss the single exception that we are aware of below). These results can be subsequently classified into results for deterministic and stochastic gradient descent (depending on the oracle in question); perhaps surprisingly, these two branches of the literature do not intersect and, for reasons that we explain below, the insights and techniques cannot be ported from one regime to the other. Historically, the first avoidance results were obtained in the stochastic setting in the early 90 s, with Pemantle [30] and Brandière & Duflo [7] being the first to establish the avoidance of hyperbolic saddle points for stochastic gradient descent methods.4 Specifically, they showed that the trajectories of any vanishing step-size stochastic approximation of a gradient flow avoid hyperbolic saddle points with probability 1, from any initial condition. More re- 3This creates an issue in the stochastic case because it is not possible to identify the point with the least gradient norm when only a single run of random gradient observations is available. In our paper, we only treat the deterministic case, so this issue does not arise. 4A saddle point is hyperbolic if the Hessian of the objective function is invertible and has at least one negative eigenvalue at said point. cently, and under a somewhat different set of assumptions for the problem s objective function, Ge et al. [12] showed that stochastic gradient descent (SGD) escapes strict saddle points (i.e., stationary points where the Hessian of the objective has at least one negative eigenvalue, but could also have zero eigenvalues), and produces iterates close to second-order optimal stationary points with high probability. Daneshmand et al. [8] further refined this result by obtaining positive probability results for second-order stationary points, while Mertikopoulos et al. [25] and Hsieh et al. [14] showed that strict saddles are avoided with probability 1, not only by SGD, but by any Robbins-Monro approximation of a gradient flow (including stochastic extra-gradient, optimistic gradient, and several other non-adaptive first-order methods). Finally, Staib et al. [33] examined a variant of SGD with adaptive preconditioning (including AMSGRAD and RMSPROP) and obtained a strict saddle escape result in the spirit of Ge et al. [12]; to the best of our knowledge, this is the only escape result for adaptive, stochastic methods, and we discuss it in detail later in the paper. On the deterministic side of the literature, Lee et al. [20] showed that the trajectories of deterministic gradient descent avoid strict saddle points from any initial condition. In a concurrent paper, Panageas & Piliouras [27] established a more general version of this result concerning non-isolated (i.e., continua of) saddle points including ridges, talwegs, or other manifolds of non-minimizing stationary points. These two approaches where shortly afterwards unified into a generalized methodology that was able to address numerous distinct first-order optimization methods including gradient descent, block coordinate descent, mirror descent and variants thereof [21]. Since these first works, several extensions and refinements have appeared, regarding the rate of convergence to second-order optimal points [9, 15], zeroth-order methods [11], constrained distributed optimization [29] and, finally, gradient descent with a vanishing step-size [28]. The key difference between the two regimes deterministic and stochastic is that stochastic results invariably rely on a positive excitation assumption for the noise: not necessarily that it is isotropic, but that it excites all directions in space in a uniformly positive amount (for a precise definition, see [5, 30]). In this regard, the noise in the optimizer s gradient queries can be viewed as providing a boost to escaping saddle points (though the extent of this boost can be relatively small in high-dimensional problems, a question that has been examined at depth in the relevant literature). By contrast, this stochastic boost is completely absent in the analysis of deterministic gradient descent schemes; as a result, the techniques employed in stochastic escape papers are likewise completely different to the techniques employed in the deterministic avoidance literature. More to the point, because the noise profile is always assumed to be persistent in the stochastic literature, results about Ada Grad Avoids Saddle Points stochastic methods do not imply anything for the deterministic case and, likewise, of course for the converse. Comparison of techniques. We outline below the technical challenges of our approach and the techniques we employed relative to other works in the literature. All previous papers in the deterministic regime with the single exception of [28], apply standard machinery from the dynamical systems literature and specifically standard variants of the celebrated stable manifold theorem [31]. This is a standard tool in analyzing the behavior of a dynamical system in the neighborhood of a stationary point; however, in its standard formulation, this theorem applies only to autonomous (i.e., time-independent) smooth maps or flows. As such, this framework turns out to be too restrictive for many interesting machine learning applications where the map producing the dynamics is also evolving itself over time. Such non-autonomous dynamical systems require particular care and case-by-case analysis as they are intractable in their full generality. As a first step in this direction, gradient descent with vanishing step-sizes was shown to provably avoid saddle points via a novel tailored version of the stable manifold theorem [28]. The dynamical system arising from ADAGRAD (this paper) poses unique and novel challenges that puts us well outside any prior approach. These differences are as follows: The update rule of ADAGRAD is a filtrationdependent function of all its history depending on time as well as all previous states. The step-size matrix of ADAGRAD could be either vanishing or non-vanishing, depending on the landscape encountered. non-vanishing, the norm converges to a non-zero constant for each initial condition x0. In view o the above, there is no a priori reason to expect that ADAGRAD should avoid saddle points in the same way that gradient descent does. In more detail, the vanishing stepsize regime requires a completely different center-stable theorem than the constant step-size case, and since ADAGRAD interpolates between the two, neither analysis or technique is sufficient for this. In addition, the fact that we have a matrix-valued step-size the preconditioning matrix it is crucial to establish sufficient control over its spectrum (and, in particular, the minimum and maximum eigenvalues thereof). We achieve this by proving the existence of a strictly positive-definite limit for the method s preconditioner which, combined with a Markovian reframing of the ADAGRAD update sequence, allows us to decompose the underlying dynamics into a stable system plus a small residual term which does not affect the method s convergence properties. Then, leveraging a Lipschitz-type condition for this remainder term allows us to derive an ADAGRAD-tailored stable manifold theorem that precludes the algorithm s convergence to strict saddle points. 2 Setup and preliminaries In this paper, we will focus on non-convex optimization problems of the form minimize f(x), (Opt) subject to x Rd. In the above, f : Rd R { } is assumed to be lower bounded and continuously differentiable, i.e., the following holds: 1. infx Rd f(x) > . 2. There exists some positive constant L > 0 such that f(x) f(y) L x y (LS) for all x, y Rd. An important property which follows directly from (LS) is the so-called descent inequality: f(x) f(y) + f(y), x y + L 2 x y 2 (Descent) for all x, y Rd. Heuristically, (Descent) allows us to upper-bound our objective by a quadratic majorant. This property will play a crucial role in what follows. Another condition for f is concerns its critical points. In particular, we say that x Rd is a strict saddle point if x is a critical point and the Hessian of f at x has at least one negative eigenvalue, i.e., f(x ) = 0 and λmin( 2f(x )) < 0. Throughout this paper we use saddle point for short when there is no ambiguity. 2.2 Algorithm and Problem In this paper, we will investigate gradient descent with an adaptive step-size in the spirit of the ADAGRAD algorithm of Duchi et al. [10]. Formally, the algorithm is described by the following recursive formula: xt+1 = xt Γt f(xt), (ADAGRAD) where the step-size Γt has following three variants: ADAGRAD with squared norm adaptation δ2 0 + Pt s=0 f(xs) 2 (ADANORM) Ada Grad Avoids Saddle Points ADAGRAD with diagonal step-size scaling 2 t (ADADIAG) Gt = δ2 0I + diag s=0 f(xs) f(xs) ! ADAGRAD with full matrix preconditioning 2 t (ADAFULL) Gt = δ2 0I + s=0 f(xs) f(xs) . We clarify that throughout this paper, all the step-size policies and gradients are deterministic. The problem we address in this article is: Do ADAGRAD algorithms provably avoid saddle points? This saddle avoidance type question stimulates the study of non-convex optimization, machine learning and dynamical systems in recent years. It is an essential part leading us to a better understanding of the power of deterministic ADAGRAD algorithms. 2.3 Technical Preliminaries For posterity, we list some fundamental concepts and results that will be frequently referred to in our analysis and proofs. Standard references for matrix calculus and Banach fixed point arguments can be found in [13] and [18] respectively. Dynamical systems. Let T = R or Z. A smooth dynamical system on Rd is a continuous differentiable mapping ϕ : T Rd Rd, where ϕ(t, x) = ϕt(x) satisfies ϕ0 : Rd Rd is the identity mapping. The composition ϕt ϕs = ϕt+s for each t, s T. In our setting, the T compositions of mappings defined by the update rule can be seen as a mapping ϕ(T, x0) from the initial point x0 to ϕT (x0). Diffeomorphism. A differentiable mapping ϕ : Rd Rd is a local diffeomorphism at x if the Jacobian matrix Dϕ(x) is invertible. Remark 1. A typical example of diffeomorphism is gradient descent with small constant step-size, whose update mapping is ϕ(x) = x η f(x). In our setting, the step-size policy is time-dependent and the mapping ϕ(t, x) = x Γt f(x) is expected to be a diffeomorphism on Rd for each t N. Matrix preliminaries. For Hermitian matrices G, H we write G H or H G to mean H G is positive semidefinite. In particular, H 0 indicates that H is positive semidefinite. This order is known as the Löwner partial order. If H is positive definite, i.e., positive semidefinite and invertible, we write H 0. The following two results are frequently applied in stabilization analysis of the adaptive step-size matrix Γt. Löwner-Heinz inequality: If A B 0 and 0 r 1 then Ar Br. Weyl s monotonicity lemma: If H is positive, and the eigenvalues of A + H and A are ordered as |λ1(A + H)| |λd(A + H)| and |λ1(A)| |λd(A)|. λi(A + H) λi(A) for all i = 1, . . . , d. Banach fixed point theorem. Let (X, d) be a complete metric space, then each contraction map T : X X has unique fixed point. Remark 2. The matrix preliminaries are used in the stabilization analysis of the step-size policy Γt. The Banach fixed point theorem plays a crucial role in the proving Theorem 1, where the complete metric space consists of sequences, and the operator T is a discrete analogy of the integral operator in the continuous time dynamical system. Uniqueness means that the sequence (generated by ADAGRAD) converging to the saddle point under discussion is unique. 3 Analysis and results In this section we provide the results that (ADANORM), (ADADIAG) and (ADAFULL) avoid saddle points from almost every initial condition. Our approach consists of three parts: The sequence of adaptive matrices Γt, t = 1, 2, . . . , converges to a (strictly) positive-definite matrix. The analysis of the local structure of the ADAGRAD iterative dynamics based on the stabilization of the method s preconditioner. Prove the local stable manifold theorem for the ADAGRAD dynamics and then extend the result to global. We will elaborate on this in the following subsections. Ada Grad Avoids Saddle Points 3.1 Stabilization of the preconditioner Proposition 1. Let Γt be one of adaptive step-size policies of (ADANORM),(ADADIAG) or (ADAFULL). Then the following statements hold: 1. For each initial point x0, the eigenvalues of Γt converges to strictly positive numbers, i.e., lim t λi(Γt) > 0 for all i = 1, . . . , d. 2. For each sequence {xt}t N generated by ADAGRAD, the sum of square of gradient norms is finite, i.e., t=0 f(xt) 2 < . 3. The limit of {Γt}t N exists and in particular, the limit is positive definite, i.e., lim t Γt = Γ with Γ 0. Remark 3. An immediate consequence of the above stablization result is that the ADAGRAD algorithms under study have a gradient decay rate of O(1/T). In particular, by Part 2 of Proposition 1 and the construction of the ADAGRAD preconditioning matrix (see also Propositions A.1 A.3 in Appendix A), we immediately get that t=1 f(Xt) 2 L 2 f(X1) inf f In turn, this yields the rate mint=1,...,T f(Xt) 2 = O(1/T) or, if we pick XT uniformly at random from X1, . . . , XT , we get E[ f( XT ) 2] = O(1/T). In the first statement of the above proposition, we briefly regard Γt as the product of the adaptive scalar and the identity matrix. In the proof of the proposition, we show the stabilization of Γt and summability of f(x) 2 independently for three adaptive policies. To provide some intuition, we will below sketch the main idea of the proof for (ADANORM), the proof of the other two ADAGRAD methods following the same strategy with additional techniques from theory of matrix analysis. In what follows, we use the notation in the Appendix, i.e., we denote δ2 0 + Pt s=0 f(xs) 2 , to emphasize that γt is the scalar step-size in (ADANORM). We begin by noting that, since γt is decreasing and bounded from below, its limit exists, i.e., lim t γt = inf t N γt = γ 0. The proof is completed by contradiction. Assume that γ = 0. Then by the fact that f is smooth, we have f(xt+1) f(xt)+ f(xt), xt+1 xt + L 2 xt+1 xt 2. By the update rule of (ADANORM), we further have xt+1 xt = γt f(xt). (1) Thus, by rearraning the descent inequality, we obtain the uper bound: f(xt) f(xt+1) + γt 2 [Lγt 1] f(xt) 2. Summing over all the terms for t = 1, . . . , T, and using the fact that f(x T ) inf f(xt), we have t=0 γt f(xt) 2 f(x1) inf t f(xt) 2 [Lγt 1] f(xt) 2. Since we assume that γt 0 as t , there must exist some t0 such that Lγt 1 < 0 for all t > t0. Therefore, the right hand side of the above inequality is finite because PT t=0 γt 2 [Lγt 1] f(xt) 2 spikes at T = t0, and thus we have 1 2 t=0 γt f(xt) 2 < + . However, we can also have the following lower bound inequality by direct calculation in Appendix, i.e., t=0 γ f(xt) 2. Again by our assumption that limt γt = 0, letting t , we conclude that, t=0 γt f(xt) 2 < , a contradiction which yields the desired result. Remark 4. The proof of Proposition 1 for the matrix-based variants of ADAGRAD follows the same template. Specifically, the first step is to show that the method s preconditioning matrix is non-increasing in the Löwner order, i.e., Γt Γt+1 for all t. Then, by Weyl s monotonicity theorem, this result subsequently translates to the eigenvalues of Γt, i.e., λi(Γt) λi(Γt+1) for all i = 1, . . . , d (where λi( ) denotes the i-th eigenvalue of the matrix in question). In view of this, by invoking a similar series of steps based on the descent inequality and an eigenvalue-per-eigenvalue decomposition, it can be shown that the limit of each λi(Γt) Ada Grad Avoids Saddle Points is strictly positive, which in turn can be used to show that Γt converges to a (strictly) positive-definite matrix and the sum of the gradient norms of f is finite. For the precise statements and proofs, we refer the reader to Propositions A.2 and A.3 in Appendix A. 3.2 Local structure of the ADAGRAD dynamics The first essential technique in proving saddle avoidance of ADAGRAD algorithms is the following. Since the increment of each iterate is the product of a preconditioned matrix and gradient vector, all three ADAGRAD algorithms can be decomposed into a stabilized matrix and a residual term. More precisely, all of the above algorithms have the following form: xt+1 = xt Γt f(xt) (2) where Γt is a sequence of positive definite matrices that converge to a symmetric positive definite matrix Γ, and we assume that Since we have assumed that Γt has the limit matrix Γ, and then we can write Γt = Γ + Γt Γ. Thus, the algorithm (2) can be written as xt+1 = xt Γt f(xt) = xt (Γ + Γt Γ) f(xt) = xt Γ f(xt) (Γt Γ) f(xt) Without loss of generality, we assume that 0 is a strict saddle point of f, i.e., f(0) = 0, then the Taylor expansion of f at 0 is the following f(x) = f(0) + 2f(0)x + θ(x) = 2f(0)x + θ(x). With the Taylor expansion of f(x) in a neighborhood of 0, we replace the first f(xt) with f(xt) = 2f(0)xt + θ(xt), provided that xt is taken from a small neighborhood of 0 where the Taylor expansion is performed. We have an equivalent expression of the dynamical system (2) through the following calculation. xt+1 = xt Γt f(xt) = xt Γ 2f(0)xt + θ(xt) (Γt Γ) f(xt) = xt Γ 2f(0)xt Γθ(xt) (Γt Γ) f(xt) = I Γ 2f(0) xt Γθ(xt) (Γt Γ) f(xt) We denote the non-linear part of the above dynamical system by η(t, x), i.e., η(t, x) = Γθ(x) (Γt Γ) f(x). (3) In Lyapunov-Perron method, the Lipschitz type condition of the whole remainder is an crucial property that enable us to prove the existence of local stable manifold. By a Taylor expansion, we trivially get θ(x) = f(x) 2f(0)x so the differential of θ(x) becomes Dθ(x) = D( f(x) 2f(0)x) = 2f(x) 2f(0), with the Lipschitzness of θ(x) being a consequence of the boundedness of Dθ(x). Sice the gradient f(x) is Lipschitz by assumption, the other part of η(t, x) satisfies the Lipschitz type condition as long as t is large enough since the norm of Γt Γ becomes arbitrarily small as t goes to infinity. The formal statement is provided in the following proposition and we defer the detailed proof to Appendix B. Proposition 2. By the definition of η(t, x) in (3), we have that for any ϵ > 0, there exist a neighborhood B of 0 and some t0 N, such that for all x, y B and t t0, we have η(t, x) η(t, y) ϵ x y 3.3 ADAGRAD avoids saddle points Given the properties of the remainder η(t, x), we are now ready to state the local stable-manifold theorem corresponding to the ADAGRAD family of algorithms. Theorem 1. Suppose 1 λ1 . . . λs > 0 > λs+1 . . . λd, H is a diagonal matrix of the form: Suppose that for any ϵ > 0, there exists a ball centering at 0 with radius δ, such that the mapping η(t, x) satisfes η(t, 0) = 0, η(t, x) η(t, y) ϵ x y for all x, y B(δ). Suppose further that γ(t, x) is a function satisfying 0 < c γ(t, x) 1 Then the dynamical system xt+1 = (I γ(t, x0)H)xt + η(t, xt) (4) Ada Grad Avoids Saddle Points has an s-dimensional stable manifold at 0. In particular, stable manifold exists for the case when γ(t, x) = 1, i.e., the dynamical system is xt+1 = (I H)xt + η(t, xt). By an s-dimensional manifold, we mean a set that can be represented as a graph of a function. The scheme of proving existence of stable manifold for a dynamical system is called Lyapunov-Perron method, which originated from the study of structure stability of dynamical system defined by ordinary differential equations. The goal of all effort is to show that if the dynamical system converges to a unstable fixed point (in our settings, it means a saddle point) with an initial point x0, then this initial point x0 lies on the graph of some function from the stable space to unstable subspace with respect to the eigenspace decomposition of the linearization of the dynamical system. We give a quick review of the Lyapunov-Perron method for continuous time dynamical system. For a detailed study of Lyapunov-Perron method, we recommend [31] as a reference. Consider the dynamical system defined by dt = A(t)x + R(t, x) (5) where A(t) is a time-dependent matrix. If the solution u(t, x0) generated by the dynamical system with some initial condition x0 converges to an unstable fixed point, then it must hold that for the integral operator T: Tu(t, x0) = U(t)x0 + Z t 0 U(t s)R(s, u(s, x0))ds t V (t s)R(s, u(s, x0))ds, u(t, x0) is a fixed point of T. We will skip the discussion on U(t) and V (t) since they play no role other than giving intuition on the form of the discrete version of T. The Banach space consists of curves converging to the fixed point and having initial points whose stable components are all equal. Banach fixed point theorem ensures the existence and uniqueness of local stable manifold of the dynamical system (5). The main challenge of proving Theorem 1 is to formulate the discrete and adaptive version of operator T and then to show that T is a contraction map on the space of sequences converging to saddle points. The reason for which one cannot treat the dynamical system arising from ADAGRAD algorithms as straightforwardly as the previous works is because the step-size is a matrix that involves all the historic iterations. Our main claim is the following: suppose the sequence {xt}t N is generated by dynamical system of ADAGRAD algorithm and this sequence converges to a saddle point, WLOG, 0. Then we only have to focus on the sequence of step-size matrices {Γt}t N which is generated by the iterations. We will treat {Γt}t N as a fixed sequence that is pre-generated, furthermore, this sequence actually depends only on the initial point x0, so whenever necessary, we use Γt(x0) to emphasize this property. Theorem 1 implies a more general result: for any sequence {Γt(x0)}t N, if the initial condition x0 makes the sequence {xt}t N converge to a saddle point, then the initial condition x0 lies on an s-dimensional manifold which is a graph of a function from the stable space to unstable space. To obtain a full discrete analogy of the solution u(t, x0) in the continuous case, we regard the sequence {xt}t N a sequence of functions {xt(x0)}t N of the initial condition x0. This notion is essential in the proof of Theorem 1. The discrete version of T acting on space of sequences is as follows: (Tx)t+1 = B(t, 0)x+ 0 + i=0 B(t, i + 1)η+(i, x0, xi) i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i) where the definition of B(t, 0), C(t + 1 + i, t + 1) will be elaborated in Appendix. T transform a sequence {xt}t N to a new sequence {(Tx)t}t N and locally we will focus on the space of sequences converging to the saddle point. The main ingredients of showing the existence of local stable manifold consist of the following: The transformed sequence {(Tx)t}t N converges to the fixed point 0 as long as {xt}t N does, (Proposition C.1); The sequences {xt}t N converging to 0 whose initial points have the same stable component, say x+ 0 = a, form a complete metric space, denoted as X(a, 0), (Lemma C.4); The operator T is a contraction mapping acting on X(a, 0), (Lemma C.6). Therefore, applying Banach Fixed Point Theorem, we can conclude that there exists a unique sequence converging to 0 for each fixed stable component of the initial condition x0, and this sequence gives the unique correspondence between the stable-unstable components of x0, and this can be written as x 0 = ϕ(x+ 0 ). Note that the dynamical system of ADAGRAD algorithms is conjugate to the dynamical system above. By assumption, the Hessian of f(x) at isolated strict saddle point 0 is symmetric and diagonalizable. For a symmetric positive definite matrix Γ, the matrix 2 Γ 2f(0)Γ 1 2 = Γ 1 2 2f(0)Γ 1 2 Ada Grad Avoids Saddle Points has the same eigenvalues as Γ 2f(0), while Γ 1 2 2f(0)Γ 1 2 has the same number of positive and negative eigenvalues as 2f(0) does. Thus Γ 1 2 2f(0)Γ 1 2 is diagonalizable and this implies that Γ 2f(0) is also diagonalizable. Moreover, the number of positive and negative eigenvalues of Γ 2f(0) agree with that of 2f(0). Suppose that the diagonalization of Γ 2f(0) can be completed by the linear transformation matrix Q, i.e., H is a diagonal matrix and Γ 2f(0) = Q 1HQ, then we have xt+1 = (I Q 1HQ)xt + η(t, xt) (6) = Q 1(I H)Qxt + η(t, xt) (7) which is equavalent to Qxt+1 = (I H)Qxt + Qη(t, xt). If we denote yt = Qxt, we have a dynamical system interms of yt as follows, yt+1 = (I H)yt + Qη(t, Q 1yt). (8) We leave the complete argument showing that Qη(t, Q 1yt) satisfies the Lipschitzness condition of Theorem 1 and the existence of local stable manifold of the dynamical system of (8) and that of xt to Appendix (Lemma C.7). Note that the existence of local stable manifold of ADAGRAD algorithm implies that the set of initial points converging to saddle point is of measure zero, but to extend the result to the whole Euclidean space, we need the following proposition assuring that the mapping defined by ADAGRAD algorithms are diffeomorphism. Proposition 3. There exists a positive number δ0 L for Lipschitz number L, such that (ADANORM), (ADADIAG) and (ADAFULL) are local diffeomorphisms on Rd. The main difficulty in understanding that ADAGRAD algorithms are diffeomorphisms comes from the fact that these algorithms depend on all iterations. However, it suffices to show that for each iterate, the map defined by the algorithm is a diffeomorphism, i.e., for the iteration xt+1 = xt Γt f(xt), we need to show that map φ(x) def = x Γt f(x), where Γt is the preconditioned step matrix from (ADAFULL), (ADANORM), or (ADADIAG), is a diffeomorphism. Take (ADAFULL) for example, s=0 f(xs) f(xs)T ! 1 to show that φ(x) is a diffeomorphism on the t th iterate, we split Γt as follows: Γt = δ2 0I + S + f(xt) f(xt)T 1 s=0 f(xs) f(xs)T . Note that only f(xt) f(xt)T depends on xt, and thus the rest terms ξI and S can be treated as constants in proving that the t th iterate is a diffeomorphism. To be precise, the mapping φ(x) = x δ2 0I + S + f(x) f(x)T 1 is expected to be a diffeomorphism as long as δ0 is properly tuned. The standard approach of showing φ(x) to be a diffeomorphism is to compute the Jacobian matrix of φ(x), and the diffeomorphism follows if the determinant of Dφ(x) is positive by Inverse Function Theorem. Now, regarding the calculation of the Jacobian matrix, (ADANORM) is fundamentally different from the other two variants. The reason for this is that in (ADANORM), the preconditioned matrix Γt is essentially a scalar function, i.e., δ2 0 + Pt s=0 f(xs) 2 , so the Jacobian matrix of φ(x) can be computed explicitly if we write it as φ(x) = x 1 p δ2 0 + S + f(x) 2 f(x) where S = Pt 1 s=0 f(xs) 2. By contrast, the diffeomorphism arguments for (ADAFULL) and (ADADIAG) require a different set of arguments. Especially for (ADAFULL), it is more convenient to analyze the determinant of Jacobian according to the condition Γt satisfies. Since the essential strategy is to show that the determinant of Dφ(x) is positive, we have the intuition that once ξ is taken to be large, the determinant of Dφ(x) can be arbitraritly close to 1. The assumptions on the boundedness of (higher-order) partial derivatives of f guarantee that the determinant of Dφ(x) is close to 1 as long as ξ is large. The detailed analysis is provided in Appendices A and B. Combining the new Stable Manifold Theorem and Proposition 3, we finally obtain: Theorem 2. (ADANORM), (ADADIAG) and (ADAFULL) avoid strict saddle points from almost any initialization. It is straightforward to conclude the saddle avoidance if the saddle points are assumed to be isolated. The proof of the Ada Grad Avoids Saddle Points last theorem extends the saddle avoidance guarantee from the countable saddle points to uncountable. We leave the full proof to the appendix. 3.4 Comparison with related results We conclude this section with a discussion of a series of related results in the literature. To put things in context, it is important to first describe in detail the type of statements obtained in the escape literature to which the work of Staib et al. [33] belongs. To begin with, the typical escape result and, in particular, the work of Staib et al. [33] follows the template below: 1. Fix some probability threshold δ > 0. 2. Run a stochastic gradient-based algorithm for a predetermined number of iterations T with a fixed step-size η (both η and T given as a function of δ). 3. Then, with probability at least 1 δ, at least one of the iterates produced by the algorithm under study will be close to a second-order stationary point (and hence away from all saddle points). [The existence of a second-order stationary point is assumed by default]. In view of this, the first important difference with results of this type is that escape results only guarantee that some iterate of the algorithm under study will be close to a secondorder stationary point. In particular, the work of Staib et al. [33] on adaptive algorithms leaves open the possibility that the algorithm may revisit a saddle point infinitely often, and it cannot exclude the event that the limit points of the algorithm may contain strict saddles. By contrast, our paper rules out exactly this behavior, so it is complementary to the analysis of Staib et al. [33] in this regard. The second fundamental difference with the escape literature is that it is typically assumed therein that the algorithm under study is subject to persistent noise lower-bounded along any direction (see for example Definition 4.2 in Staib et al. [33] and the discussion right after). More concretely, this means that the iterates of the algorithms studied in this literature are subject to continual random shocks, a fact which greatly facilitates the escape from saddle points (in the sense described above). This is readily seen in the bounds for T given by Staib et al. [33], which are of the form T = O(1/ν4), with ν > 0 denoting the minimum noise level along any direction. [In particular, it is not possible to get deterministic results by setting the variance of the noise to 0 in Staib et al. [33].] These differences are also reflected in the divergent tools and techniques required to establish avoidance results compared to the escape literature. Specifically, thanks to the persistent noise in the setup of standard escape results, the analysis does not require any delicate center stable manifold arguments. However, these arguments are indispensable for excluding strict saddles as limit points of the underlying dynamics, a fact which serves to explain the gulf in techniques between Staib et al. [33] and our paper. As far as we are aware, there is no comparable stable manifold theorem for adaptive algorithms in the literature. In regards to the stochastic setting, to the best of our knowledge, the only avoidance results for stochastic algorithms are [7, 14, 25, 30]. These stochastic avoidance results concern exclusively non-adaptive algorithms with persistent noise. Obtaining avoidance results for stochastic adaptive methods would be a very fruitful direction for future research, but not one which can be attacked at this stage. 4 Concluding remarks In this paper we examined the saddle-point avoidance properties of the ADAGRAD family of methods (ranging from scalar to full-matrix preconditioning), and we showed that all policies under study avoid saddle points from almost any initial condition. A major challenge in our analysis is that the dynamical system arising from ADAGRAD is not only time-dependent, but filtration-dependent. Nonetheless, after an extensive stabilization analysis for the method s preconditioner, the induced dynamical system reduces to a form that enables us to apply the Lyapunov-Perron method to prove a new stable manifold theorem for ADAGRAD. These techniques and results not only advance our understanding of adaptive gradient methods, but they also initiate the study of saddle-point avoidance results for other methods like adaptive mirror descent; we leave this to future work. Acknowledgments KA is grateful for financial support by the Swiss National Science Foundation (SNSF) under grant number 200021_205011. PM is grateful for financial support by the French National Research Agency (ANR) in the framework of the Investissements d avenir program (ANR-15-IDEX02), the Lab Ex PERSYVAL (ANR-11-LABX-0025-01), MIAI@Grenoble Alpes (ANR-19-P3IA-0003), and the bilateral ANR-NRF grant ALIAS (ANR-19-CE48-0018-01). GP acknowledges that this research/project is supported in part by the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG2RP-2020-016), NRF 2018 Fellowship NRF-NRFF201807, NRF2019-NRF-ANR095 ALIAS grant, grant PIESGP-AI-2020-01, AME Programmatic Fund (Grant No. A20H6b0151) from the Agency for Science, Technology and Research (A*STAR) and Provost s Chair Professorship grant RGEPPV2101. XW acknowledges Grant 202110458 from SUFE and support from the Shanghai Research Center for Data Science and Decision Technology. Ada Grad Avoids Saddle Points [1] Antonakopoulos, K. and Mertikopoulos, P. Adaptive firstorder methods revisited: Convex optimization without Lipschitz requirements. In Neur IPS 21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021. [2] Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. Adaptive extra-gradient methods for min-max optimization and games. In ICLR 21: Proceedings of the 2021 International Conference on Learning Representations, 2021. [3] Antonakopoulos, K., Pethick, T., Kavis, A., Mertikopoulos, P., and Cevher, V. Sifting through the noise: Universal first-order methods for stochastic variational inequalities. In Neur IPS 21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021. [4] Antonakopoulos, K., Vu, D. Q., Cevher, V., Levy, K. Y., and Mertikopoulos, P. Under Grad: A universal black-box optimization method with almost dimension-free convergence rate guarantees. In ICML 22: Proceedings of the 39th International Conference on Machine Learning, 2022. [5] Benaïm, M. and Hirsch, M. W. Asymptotic pseudotrajectories and chain recurrent flows, with applications. Journal of Dynamics and Differential Equations, 8(1):141 176, 1996. [6] Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. [7] Brandière, O. and Duflo, M. Les algorithmes stochastiques contournent-ils les pièges ? Annales de l Institut Henri Poincaré, Probabilités et Statistiques, 32(3):395 427, 1996. [8] Daneshmand, H., Kohler, J., Lucchi, A., and Hofmann, T. Escaping saddles with stochastic gradients. In ICML 18: Proceedings of the 35th International Conference on Machine Learning, 2018. [9] Du, S. S., Jin, C., Lee, J. D., Jordan, M. I., Póczos, B., and Singh, A. Gradient descent can take exponential time to escape saddle points. In NIPS 17: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. [10] Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121 2159, 2011. [11] Flokas, L., Vlatakis-Gkaragkounis, E. V., and Piliouras, G. Efficiently avoiding saddle points with zero order methods: No gradients required. In Neur IPS 19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019. [12] Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points Online stochastic gradient for tensor decomposition. In COLT 15: Proceedings of the 28th Annual Conference on Learning Theory, 2015. [13] Horn, R. A. and Johnson, C. R. Matrix Analysis. Cambridge University Press, Cambridge, 1985. [14] Hsieh, Y.-P., Mertikopoulos, P., and Cevher, V. The limits of min-max optimization algorithms: Convergence to spurious non-critical sets. In ICML 21: Proceedings of the 38th International Conference on Machine Learning, 2021. [15] Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In ICML 17: Proceedings of the 34th International Conference on Machine Learning, 2017. URL http://proceedings. mlr.press/v70/jin17a.html. [16] Kavis, A., Levy, K. Y., Bach, F., and Cevher, V. Unix Grad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In Neur IPS 19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019. [17] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. https://arxiv.org/abs/1412.6980, 2014. [18] Latif, A. Topics in Fixed Point Theory. Springer, 2014. [19] Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436 444, 2015. [20] Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In COLT 16: Proceedings of the 29th Annual Conference on Learning Theory, 2016. [21] Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I., and Recht, B. First-order methods almost always avoid strict saddle points. Mathematical Programming, 176 (1):311 337, February 2019. [22] Levy, K. Y., Yurtsever, A., and Cevher, V. Online adaptive methods, universality and acceleration. In Neur IPS 18: Proceedings of the 32nd International Conference of Neural Information Processing Systems, 2018. [23] Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In AISTATS 19: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019. [24] Mc Mahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In COLT 10: Proceedings of the 23rd Annual Conference on Learning Theory, 2010. [25] Mertikopoulos, P., Hallak, N., Kavis, A., and Cevher, V. On the almost sure convergence of stochastic gradient descent in non-convex problems. In Neur IPS 20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020. [26] Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2). Proceedings of the USSR Academy of Sciences, 269(543-547), 1983. [27] Panageas, I. and Piliouras, G. Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. In ITCS 17: Proceedings of the 8th Conference on Innovations in Theoretical Computer Science, 2017. [28] Panageas, I., Piliouras, G., and Wang, X. First-order methods almost always avoid saddle points: The case of vanishing step-sizes. In Neur IPS 19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019. [29] Panageas, I., Piliouras, G., and Wang, X. Multiplicative weights updates as a distributed constrained optimization algorithm: Convergence to second-order stationary points almost always. In International Conference on Machine Learning, pp. 4961 4969. PMLR, 2019. [30] Pemantle, R. Nonconvergence to unstable points in urn models and stochastic aproximations. Annals of Probability, 18(2):698 712, April 1990. [31] Perko, L. Differential Equations and Dynamical Systems. Springer, 2001. Ada Grad Avoids Saddle Points [32] Reddi, S. J., Kale, S., and Kumar, S. On the convergence of Adam and beyond. In ICLR 18: Proceedings of the 2018 International Conference on Learning Representations, 2018. [33] Staib, M., Reddi, S., Kale, S., Kumar, S., and Sra, S. Escaping saddle points with adaptive gradient methods. In ICML 19: Proceedings of the 36th International Conference on Machine Learning, 2019. [34] Vu, D. Q., Antonakopoulos, K., and Mertikopoulos, P. Fast routing under uncertainty: Adaptive learning in congestion games with exponential weights. In Neur IPS 21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021. [35] Ward, R., Wu, X., and Bottou, L. Ada Grad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. In ICML 19: Proceedings of the 36th International Conference on Machine Learning, 2019. Ada Grad Avoids Saddle Points A Stabilization of the ADAGRAD preconditioners In this section we shall investigate the asymptotic behaviour of the classical gradient descent algorithmic scheme run with different types of step-sizes. More precisely, in what follows we shall investigate the case of a scalar, diagonal and full matrix step-sizes. For the sake of convenience we develop their respective analysis individually. A.1 Ada Grad with scalar step-size We start by investigating the particular case of the gradient descent, xt+1 = xt γt f(xt) (GD) run with the adaptive scalar step-size policy: δ2 0 + Pt s=0 f(xs) 2 (Ada Norm) with δ2 0 > 0. The first result concerns the asymptotic stabilization of the adaptive step-size around a (strictly) positive value γ > 0. Proposition A.1. Assume that f : Rd R is smooth and xt are the iterates of (GD) run with the adaptive step-size policy (Ada Norm). Then, the following hold: 1. The (Ada Norm) step-size γt converges to some strictly positive value γ , i.e., γt inf t N γt = γ > 0 (A.1) 2. The sequence { f(xt) 2 }t N is summable, i.e., t=0 f(xt) 2 < + (A.2) Proof. We shall start with the first property. Since γt is decreasing and bounded from below by 0, we have. that its limit exists and more precisely: lim t + γt = inf t N γt = γ 0 Assume that γ = 0. Then, by the fact that f is smooth, we can choose positive number β > L, and we have: f(xt+1) f(xt) + f(xt), xt+1 xt + β 2 xt+1 xt 2 = f(xt) γt f(xt) 2 + βγ2 t 2 f(xt) 2 2γt f(xt) 2 1 2γt f(xt) 2 + βγ2 t 2 f(xt) 2 2γt f(xt) 2 + γt 2 [βγt 1] f(xt) 2 which yields after rearranging, 1 2γt f(xt) 2 f(xt) f(xt+1) + γt 2 [βγt 1] f(xt) 2 (A.3) One the other hand, one may directly verify that the quantity γt 2 [βγt 1] f(xt) 2 becomes non-positive whenever γt 1/β. Now, by the assumption that γ = 0 there exists some t0 N such that: β for all t > t0 Ada Grad Avoids Saddle Points So, after telescoping (A.3), we have: t=0 γt f(xt) 2 f(x1) inf t f(xt) + 2 [βγt 1] f(xt) 2 = f(x1) inf t f(xt) + 2 [βγt 1] f(xt) 2 + 2 [βγt 1] f(xt) 2 f(x1) inf t f(xt) + 2 [βγt 1] f(xt) 2 with the last inequality being obtained by the definition of t0. We proceed by bounding the quantity 1 2 PT t=0 γt f(xt) 2 from below. More precisely, we have: t=0 γt f(xt) 2 = 1 δ2 0 + Pt s=0 f(xs) 2 δ2 0 + PT t=0 f(xt) 2 t=0 f(xt) 2 δ2 0 + PT t=0 f(xt) 2 t=1 f(xt) 2 δ2 0 = 1 2γT γT δ2 0 2 with the last inequality being obtained by the fact that γt 1/δ0 for all t = 1, 2, . . . So, summarizing the above estimations we get: 1 2γT f(x1) inf t f(xt) + δ0 2 [βγt 1] f(xt) 2 (A.4) Now, by letting T + we have that 1 2γT + , since we assumed that γt 0 and hence we get that: + f(x1) inf t f(xt) + δ0 2 [βγt 1] f(xt) 2 < + (A.5) which is a contradiction. Hence, we readily get that γ > 0 and the result follows. For the second claim, we have that: t=0 f(xt) 2 = lim T + t=0 f(xt) 2 with the last strict inequality being obtained by the fact that γ > 0 being invoking our first claim and hence the result follows. Ada Grad Avoids Saddle Points A.2 Ada Grad with diagonal adaptation We next investigate the diagonal Ada Grad method. This is given by the following formula: xt+1 = xt Γt f(xt) (A.6) where Γt Rd d is a sequence of matrices defined as the inverse square root of Gt: δ2 0I + diag s=0 f(xs) f(xs) !# 1 2 (Ada Diag) Lemma A.1. Assume that xt aer the iterates of Ada Grad algorithm with step-size policy (Ada Diag). Then the following hold: 1. The sequence of matrices {Γt}t N is non-increasing in the Löwner sense, i.e., Γt Γt+1 for all t = 1, 2, ... 2. The sequence of eigenvalues is non-increasing, i.e., λi(Γt) λi(Γt+1) Proof. We show the first claim. By definition of Gt in the (Ada Diag), we have that Gt+1 Gt = diag f(xt+1) f(xt+1) 0 which means that Gt is a Löwner non-decreasing sequence, i.e., Therefore, by applying Löwner-Heinz inequality, we can have 1 2 t for all t = 1, 2, ... and then Γt Γt+1. The second claim is straightforward since Gt is diagonal, thus Γt is diagonal, and the eigenvalues are the diagonal entries,i.e.,we have that λi(Γt) λi(Γt+1). Proposition A.2. Assume that f : Rd R is smooth and xt are the iterates run with with adaptive step-size policy (Ada Diag). Then, the following hold: 1. The sequence of the eigenvalues {λi(Γt)}t N converges to strictly positive value λ i for all i = 1, 2, ... i.e., lim t λi(Γt) = inf t N λi(Γt) = λ i > 0 for all i = 1, 2, ...d 2. The sequence { f(xt) 2 }t N is summable, i.e., t=0 f(xt) 2 < . Ada Grad Avoids Saddle Points Proof. Since {Γt}t N is a decreasing sequence of diagonal positive definite matrices, the eigenvalues λi(Γt) is bounded below by 0, i.e., lim t λi(Γt) = inf t N λi(Γt) = λ i 0 for some λ i . We will show that λ i is actually strictly positive. By the fact that f is smooth, we can choose β > L and have: f(xt+1) f(xt) + f(xt), xt+1 xt + β 2 xt+1 xt 2 = f(xt) f(xt) Γt f(xt) + β 2 Γt f(xt) 2 2 f(xt) Γt f(xt) 1 2 f(xt) [I βΓt]Γt f(xt) 2 f(xt) Γt f(xt) + 1 xi (xt) 2 (βλ2 i (Γt) λi(Γt)). By rearranging, we have 1 2 f(xt) Γt f(xt) f(xt) f(xt+1) + 1 xi (xt) 2 (βλ2 i (Γt) λi(Γt)) (A.7) t=0 f(xt) Γt f(xt) f(x1) inf t f(xt) + 1 xi (xt) 2 (βλ2 i (Γt) λi(Γt)). (A.8) Since λi(Γt) is assumed to approach 0 as t , there must exist some t0, such that for all t > t0, βλ2 i (Γt) λi(Γt) < 0, and this implies t=0 f(xt) Γt f(xt) f(x1) inf t f(xt) + 1 xi (xt) 2 (βλ2 i (Γt) λi(Γt)) for T > t0. On the other hand, the lower bound can be estimated as follows: t=0 f(xt) Γt f(xt) 1 t=0 f(xt) ΓT f(xt) h f xi (xt) i2 δ2 0 + PT s=1 h f xi (xs) i2 δ2 0 + PT s=0 h f xi (xs) i2 where i is the one with respect to λi(Γt) that goes to 0 as t . t=0 f(xt) Γt f(xt) 1 δ2 0 + PT s=0 h f xi (xs) i2 Ada Grad Avoids Saddle Points δ2 0 + PT s=0 h f xi (xs) i2 xi (xt) 2 δ2 0 = 1 2γT γT δ2 0 2 where we denote δ2 0 + PT s=0 h f xi (xs) i2 , and the last inequality comes from the fact that we can choose T such that γT < 1 δ0 . By rearranging and combining previous estimate, we have: t=0 f(xt) Γt f(xt) + δ0 2 + f(x1) inf t f(xt) + 1 xi (xt) 2 (βλ2 i (Γt) λi(Γt)) and letting T , we have γT 0, and therefore t=0 f(xt) Γt f(xt) + δ0 which is a contradiction. To show that the sequence of gradient norm is summable, it sufficies to show that for each component i, P t=0 h f xi (xt) i2 < . This is true since we have xi (xt) 2 = lim T = lim T 1 γ2 T δ2 0 where γT is from the argument in proving the first claim, and the last inequality holds also because of the first claim, since for all i, the limit lim T 1 γ2 T exists and is finite. This proves our assertion and completes our proof. A.3 Ada Grad with full matrix adaptation Finally, we examine the full matrix version of the Ada Grad-type methods. In particular, this is given by the following recursive formula: xt+1 = xt Γt f(xt) (A.9) Ada Grad Avoids Saddle Points where Γt Rd d is a sequence of symmetric (full dimensional) matrices defined as the inverse square root of Gt: s=0 f(xs) f(xs) # 1 2 (ADAFULL) Lemma A.2. Assume that xt are the iterates of (A.9) run with the adaptive step-size policy (ADAFULL). Then, the following hold: 1. The sequence of matices {Γt}t N is non-increasing in the Löwner sense, i.e., Γt Γt+1 for all t = 1, 2, . . . (A.10) 2. The sequence of eigenvalues λi(Γt) is non-increasing, i.e., λi(Γt) λi(Γt+1) (A.11) Proof. We begin with the first claim. By definition of Gt we have: Gt+1 Gt = f(xt+1) f(xt+1) 0 (A.12) which in turn yields that Gt is a Löwner non-decreasing sequence, i.e.,: Gt+1 Gt for all t = 1, 2, . . . (A.13) Hence, by applying Löwner-Heinz inequality we readily get: 1 2 t for all t = 1, 2, . . . (A.14) and so by applying Weyl s monotonicity thoerem for Γt = G 1 2 t we have: and λi(Γt) λi(Γt+1). We proceed by providing the following proposition Proposition A.3. Assume that f : Rd R is smooth and xt are the iterates of (A.9) run with the adaptive step-size policy (ADAFULL). Then, the following hold: 1. The sequence of the eigenvalues {λi(Γt)}t N converges to a strictly positive value λ i for all i = 1, 2, . . . , d, i.e., lim t + λi(Γt) = inf t N λi(Γt) = λ i > 0 for all i = 1, 2, . . . d (A.15) 2. The sequence { f(xt) 2 }t N is summable, i.e., t=0 f(xt) 2 < + (A.16) Proof. We start with the proof of our first claim. By combining the fact that {Γt}t N is a decreasing sequence of matrices and Weyl s monotonicity inequality we have that for every i = 1, 2, . . . d the respective sequence of eigenvalues λi(Γt) is non-increasing. Moreover, since Γt is positive definite for all t = 1, 2, . . . we readily get for all i = 1, 2, . . . d the sequence {λi(Γt)}t N is bounded from below by 0. Therefore, for all i = 1, 2, . . . d the limit of the sequence {λi(Γt)}t N exists and more precisely: lim t + λi(Γt) = inf t N λi(Γt) = λ i 0 for all i = 1, 2, . . . d (A.17) Ada Grad Avoids Saddle Points We assume now that there exists some i0 {1, 2, . . . , d} such that: lim t + λi0(Γt) = 0 (A.18) Then, by the fact that f is smooth, we can choose β > L and have: f(xt+1) f(xt) + f(xt), xt+1 xt + β 2 xt+1 xt 2 = f(xt) f(xt) Γt f(xt) + β 2 Γt f(xt) 2 = f(xt) f(xt) Γt f(xt) + β 2 f(xt) Γ t f(xt)Γt 2 f(xt) Γt f(xt) 1 2 f(xt) I βΓ t Γt f(xt) and by rearranging we have: 1 2 f(xt) Γt f(xt) f(xt) f(xt+1) 1 2 f(xt) I βΓ t Γt f(xt) = f(xt) f(xt+1) 1 2 f(xt) [I βΓt] Γt f(xt) Now, since Γt is decreasing we have: 1 δ0 I = Γ0 Γt (A.19) where under the assumption δ0 > β yields: I βΓt 0 for all t = 1, 2, . . . (A.20) and hence by telescoping we have: t=0 f(xt) Γt f(xt) f(x1) inf t N f(xt) (A.21) On the other hand, we bound the quantity 1 2 PT t=0 f(xt) Γt f(xt) from below as follows: t=0 f(xt) Γt f(xt) t=0 f(xt) ΓT f(xt) t=0 tr(ΓT f(xt) f(xt) ) t=0 f(xt) f(xt) ) t=0 f(xt) f(xt) δ2 0I = tr(ΓT GT ) δ2 0 tr(ΓT ) Now since by definition Γt = G 1 2 t we have that ΓT GT = G 1 2 T = Γ 1 T . Furthermore by the monotonicity of the trace operator (cf. +++) we have that: δ2 0 tr(ΓT ) δ2 0 tr(Γ0) (A.22) and by the fact Γ0 = 1 δ0 I we have: δ2 0 tr(ΓT ) δ0d (A.23) Ada Grad Avoids Saddle Points and hence we have: t=0 f(xt) Γt f(xt) tr(Γ 1 T ) δ0d 1 λi(ΓT ) δ0d 1 λi0(ΓT ) δ0d Therefore, summarizing the above estimations we get: 1 λi0(ΓT ) f(x1) inf t N f(xt) + δ0d (A.24) Now, by letting T + we get that 1 λi0(ΓT ) + since we assumed that λi0(ΓT ) 0. So, summarizing the above estimations: + f(x1) inf t N f(xt) + δ0d < + (A.25) which is a contradiction. Hence, for all i = 1, 2, . . . d the sequence of the eigenvalues λi(Γt) converges to a strictly positive value λ i > 0 and therefore the result follows. Now, we turn our attention towards the proof of our second claim. In particular, by the above we have: t=0 f(xt) Γt f(xt) f(x1) inf t N f(xt) (A.26) Moreover, we have: t=0 f(xt) Γt f(xt) t=0 f(xt) ΓT f(xt) t=0 f(xt) f(xt) t=0 f(xt) 2 So, we have: t=1 f(xt) 2 f(x1) inf t f(xt) (A.27) Now, by letting T + and recalling from our previous claim that: lim T + λmin(ΓT ) = λ min > 0 (A.28) we get: T X t=0 f(xt) 2 1 λ min f(x1) inf t N f(xt) < + (A.29) and the result follows. We proceed by showing that matrix sequence {Gt}t N itself stabilizes asymptotically around a positive definite matix G . Formally, we have the following proposition. Ada Grad Avoids Saddle Points Proposition A.4. Assume that f : Rd R is smooth and xt are the iterates of (A.9) run with the adaptive step-size policy (ADAFULL). Then, the limit of {Gt}t N exists and in particular we have: lim t + Gt = G with G 0 (A.30) Proof. We start by showing the limit existence of Gt. Fix i = j {1, 2, . . . d}. We then have: For the diagonal terms of Gt: [Gt]i,i: s=0 f(xs) f(xs) # f(xs) f(xs) s=0 ( f i(xs))2 t=0 f(xs) f(xs) with the last inequality being obtained by Proposition A.3. So, since the sequence {[Gt]i,i}t N is a non-decreasing and upper-bounded its limit exists. For the terms [Gt]i,j we have: s=0 f(xs) f(xs) # f(xs) f(xs) s=0 f i(xs) f j(xs) Now, by applying Cauchy-Schwartz inequality we have: s=0 | f i(xs) f j(xs)| s=0 ( f i(xs))2 s=0 ( f i(xs))2 s=1 f(xt) f(xt) 2 with the last strict inequality being obtained by Proposition A.3. So, we get that Pt s=1 f i(xs) f j(xs) converges absolutely, which yields that it converges. This in turn yields that the limit of [Gt]i,j exists. So, summarizing we get that the limit of Gt exists, i.e., lim t + Gt = G (A.31) On the other hand, concerning the positive definiteness of G we have for all i = 1, 2, . . . d: λi(G ) = lim t + λi(Gt) Ada Grad Avoids Saddle Points = lim t + λi(Γ 2 t ) = lim t + 1 λi(Γt)2 = 1 (λ i )2 which by invoking Proposition A.3 yields that: λi(G ) > 0 for all i = 1, 2, . . . d (A.32) which concludes the proof. B Proof of Proposition 2 Proof. Recall that in the context before proposition 2, it is denoted that η(t, x) = Γθ(x) (Γt Γ) f(x). The proof consists of two parts. Part 1: Recall that θ(x) is the remainder of the Taylor expansion of f(x) at 0, i.e. f(x) = f(0) + 2f(0)x + θ(x) (B.1) = 2f(0)x + θ(x) (B.2) and then θ(x) = f(x) 2f(0)x. The differential of θ(x) is Dθ(x) = D( f(x) 2(0)x) = 2f(x) 2f(0). From the Fundamental Theorem of Calculus and chain rule, we have that θ(x) θ(y) = Z 1 d dtθ(tx + (1 t)y)dt (B.3) Dθ(z)|z=tx+(1 t)y d dt(tx + (1 t)y)dt (B.4) Dθ(z)|z=tx+(1 t)y (x y)dt. (B.5) Thus, the norm of the difference θ(x) θ(y) can be estimated as θ(x) θ(y) = Z 1 Dθ(z)|z=tx+(1 t)y (x y)dt (B.6) 0 Dθ(z)|z=tx+(1 t)y (x y) dt (B.7) 0 Dθ(z)|z=tx+(1 t)y x y dt (B.8) 0 Dθ(z)|z=tx+(1 t)y dt x y , (B.9) where Dθ(z)|z=tx+(1 t)y = 2f(z) 2f(0) |z=tx+(1 t)y Since 2f(x) is assumed to be Lipschitz, for any ϵ > 0, there exists a δ-ball B of 0, such that for any z B, it holds that 2f(z) 2f(0) ϵ, and this completes the proof of part 1. Ada Grad Avoids Saddle Points Part 2: Since the limit of Γt is Γ and then given any ϵ > 0, there exists t0, such that for all t > t0, Γt Γ < ϵ. On the other hand, the gradient f(x) is assumed to be Lipschitz, thus for x, y B, and t > t0, it holds that Γt Γ f(x) f(y) ϵL x y . Therefore, the norm of difference η(t, x) η(t, y) is estimated as η(t, x) η(t, y) Γθ(x) (Γt Γ) f(x) + Γθ(y) + (Γt Γ) f(y) (B.10) Γθ(x) Γθ(y) + (Γt Γ) f(x) (Γt Γ) f(y) (B.11) Γ θ(x) θ(y) + Γt Γ f(x) f(y) (B.12) For a given ϵ > 0, find a neighborhood B of 0 such that for all x, y B, θ(x) θ(y) ϵ Γ x y , f(x) f(y) L x y , L for t t0. Thus, we have η(t, x) η(t, y) ϵ x y , and the proof completes. C Proof of Theorem 1 In the proof of Theorem 1, we will denote ξ = δ2 0 for convenience. A(m, n) = (I γ(m, x0)H)...(I γ(n, x0)H) (C.1) = B(m, n) C(m, n) for m n and I for m < n, where B(m, n) = (I γ(m, x0)H+)...(I γ(n, x0)H+) and C(m, n) = (I γ(m, x0)H )...(I γ(n, x0)H ). Let v be a vector, we denote v+ the stable component and v the unstable component of v, i.e., v+ Es and v Eu, E+ E = T0Rd = Rd. Then the solution xt+1 = (x+ t+1, x t+1) can be written in the form of stable-unstable: x+ t+1 = B(t, 0)x+ 0 + i=0 B(t, i + 1)η+(i, xi) x t+1 = C(t, 0)x 0 + i=0 C(t, i + 1)η (i, xi). Then x 0 can be written as x 0 = C(t, 0) 1x t+1 C(t, 0) 1 t X i=0 C(t, i + 1)η (i, xi). (C.3) Ada Grad Avoids Saddle Points Simplifying the notation by denoting Ci = I γ(i, x0)H , C(t, 0) = Ct...C0, and C(t, 0) 1 = C 1 0 ...C 1 t . (C.4) Since H is the diagonal matrix of all negative eigenvalues, the inverse of Ci is the following: 1 1 γ(i,x0)λs+1 ... 1 1 γ(i,x0)λd and the entries satisfy 1 > 1 1 γ(i, x0)λs+1 1 1 γ(i, x0)λd > 0. Since γ(i, x0) > c is uniformly bounded above 0 in a neighborhood of 0, then the norm C 1 i satisfies C 1 i 1 1 cλs+1 < 1, and then C(t, 0) 1 0, as t . Using the expression of (C.4), (C.3) can be written as x 0 = C(t, 0) 1x t+1 C(t, 0) 1 t X i=0 C(t, i + 1)η (i, x0, xi) (C.5) = C 1 0 ...C 1 t x t+1 C 1 0 η (0, x0, x0) + ... + C 1 0 ...C 1 t η (t, x0, xt) . (C.6) If xt 0, then the sequence is bounded in a neighborhood of x0. So assuming xt is bounded, we can push t to (since the identity holds for any t) and we have i=1 C(i 1, 0) 1η (i 1, x0, xi 1). (C.7) So far the above derivation is heuristic and no existence or uniqueness is guaranteed. But we can go one step further to see where the "Stable Manifold" comes from. We say that the initial condition x0 lies on a manifold is equivalent to saying that x0 lies on a graph of some mapping from stable space to unstable space, i.e., (x+ 0 , x 0 ) satisfies that x 0 = φ(x+ 0 ) for some φ : Es Eu. The right hand side of C.7 contains C(i 1, 0) 1 that involves x0 = (x+ 0 , x 0 ) and the sequence xi 1 is also determined by the initial condition x0, so we can also write xi 1 as a function of x0 in the following xi 1 = xi 1(x+ 0 , x 0 ). Therefore the equation C.7 can be rewritten as i=1 C(i 1, 0, (x+ 0 , x 0 )) 1η (i 1, (x+ 0 , x 0 ), xi 1(x+ 0 , x 0 )), where the right hand side is a function of (x+ 0 , x 0 ) and we can denote the right hand side as Φ(x+ 0 , x 0 ). Then the stable manifold x 0 = φ(x+ 0 ) (as an implicit function) is expected to be solved from the equation x 0 = Φ(x+ 0 , x 0 ). Ada Grad Avoids Saddle Points It suffices to show that for the sequence generated by the dynamical system of Ada Grad algorithm with initial condition x0 = a x 0 (when a satisfies certain condition), the unstable component x 0 in uniquely determined by a. This comes from that the operator T is a contraction mapping on the space of convergent sequences (with 0 the limit) whose 0 th terms have the same stable component (the rest of the paper). Since the sequence that converges to 0 while generated by the dynamical system is the unique fixed point of T, the initial condition x0 is also unique. But the stable component x+ 0 is fixed, so the uniqueness of the unstable component x 0 implies the existence of some function φ : Es Eu so that x 0 = φ(x+ 0 ). Lemma C.1. Suppose i=1 C(i 1, 0) 1η (i 1, x0, xi 1). B(t, 0)x+ 0 + i=0 B(t, i + 1)η+(i, x0, xi) i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i) Proof. It suffices to show that i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i). (C.9) It has been shown that x t+1 = C(t, 0)x 0 + i=0 C(t, i + 1)η (i, x0, xi) (C.10) and assumed i=1 C(i 1, 0) 1η (i 1, x0, xi 1). (C.11) Plug C.11 into C.10: x t+1 = C(t, 0) i=1 C(i 1, 0) 1η (i 1, x0, xi 1) i=0 C(t, i + 1)η (i, x0, xi). (C.12) Splitting P i=1 C(i 1, 0) 1η (i 1, x0, xi 1) into i=1 C(i 1, 0) 1η (i 1, x0, xi 1) i=t+2 C(i 1, 0) 1η (i 1, x0, xi 1), and we have I = C(t, 0) i=1 C(i 1, 0) 1η (i 1, x0, xi 1) i=t+2 C(i 1, 0) 1η (i 1, x0, xi 1) i=1 C(i 1, 0)η (i 1, x0, xi 1) C(t, 0) i=k+2 C(i 1, 0) 1η (i 1, xi 1) (C.14) Ada Grad Avoids Saddle Points i=0 C(t, i + 1)η (i, x0, xi) i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i). (C.15) Therefore, by putting I back into C.12, we have i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i) which is exactly same as C.9. The idea of Lyapunov-Perron method is to consider the right hand side of C.8 as an operator acting on a sequence {xt}t N, note that this sequence {xt}t N is arbitrary and not necessarily generated by gradient descent or any other algorithm. Specifically we call the action T on a sequence {xt}t N as B(t, 0)x+ 0 + i=0 B(t, i + 1)η+(i, x0, xi) i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i) where t 0. Moreover, by the definition of B(m, n) = I if m < n, we have formally that the 0 th term of the transformed sequence {(Tx)}t N is (Tx)0 = x+ 0 i=0 C(i, 1) 1η (i, x0, xi) This means that the action of the operator T preserves the stable component of the 0 th term of any sequence on which T acts. Another important property of T is this: Consider the definition of B(m, n), C(m, n) and η(t, x0, xt), we can conclude that if the sequence {xt}t N is generated by the algorithm (adaptive gradient descent), then such sequence is a fixed point" of the operator T, i.e., xt+1 = (I γ(t, x0)H) xt + η(t, x0, xt) = Tx = x where x = {xt}t N and Tx = {(Tx)t}t N. Proposition C.1. limt (Tx)t = 0 if xt 0. Proof. The matrix B(t, 0) is in the form of: (I γ(t, x0)H+) (I γ(0, x0)H+) where H+ is diagonal of positive eigenvalues, so B(t, 0) 0 as t . Combining with Lemma C.2 and C.3, we can conclude the result. Lemma C.2. Suppose that limt xt = 0. Then i=0 B(t, i + 1)η (i, x0, xi) = 0. Proof. Denote γt = γ(t, x0) for short, and we do the following estimation. i=0 B(t, i + 1)η (i, x0, xi) i=0 B(t, i + 1)η (i, x0, xi) i=0 B(t, i + 1) η (i, x0, xi) i=0 B(t, i + 1) η(i, x0, xi) Ada Grad Avoids Saddle Points i=0 B(t, i + 1) γiϵ xi i=0 B(t, i + 1) γi xi i=0 (1 γtλs) (1 γi+1λs)γi xi . The last equality from the above, i.e., B(t, i + 1) = (1 γtλs) (1 γi+1λs) comes from the notation of B(t, i + 1) which is defined as: B(t, i + 1) = 1 γtλ1 ... 1 γtλs 1 γi+1λ1 ... 1 γi+1λs where λ1 ... λs > 0. i=0 (1 γtλs) (1 γi+1λs)γi xi = (1 γtλs) (1 γ1λs)γ0 x0 + + γt xt St+1 = (1 γt+1λs)...(1 γ1λs)γ0 x0 + + (1 γt+1λs)γt xt + γt+1 xt+1 = (1 γt+1λs) ((1 γtλs) (1 γ1λs)γ0 x0 + + γt xt ) + γt+1 xt+1 = (1 γt+1λs)St + γt+1 xt+1 = St γt+1λs St + γt+1 xt+1 = St + γt+1( xt+1 λs St). Subtracting St, we have that St+1 St = γt+1( xt+1 λs St), and the following observation for the sequence {St}t N is immediate: Case 1: If St+1 St > 0, then xt+1 λs St > 0; Case 2: If St+1 St < 0, then xt+1 λs St < 0; Case 3: If St+1 St = 0, then xt+1 λs St = 0. Note that Case 3 trivially implies that St 0 by the assumption in the lemma that xt 0. Moreover, Case 1 can be interpreted as follows: If the sequence St is successively increasing over an interval of integers, i.e., t [n, ...n + m], then the terms of St over this interval are dominated by terms of the convergent sequence { xt λs }t N because xt+1 λs St > 0 xt+1 Therefore, the interval of integers over which St is increasing successively cannot have infinite length, and thus, Case 2 has to occur infinitely often. Then the sequence St must have one of the following two patterns: Ada Grad Avoids Saddle Points Pattern 1: St has infinitely many intervals of integers on which St is increasing successively, Pattern 2: St is decreasing after certain t N. If St is of Pattern 1, then we have lim sup t St lim t xt = 0. If St is of Pattern 2, then St is convergent and we can denote its limit by S [0, ). Taking limit for t in the following relation St+1 St = γt+1( xt λs St) we have that 0 = S S = c(0 λs S ). Since c is positive (recall that γt = γ(t, x0) is uniformly bounded away from 0 if x0 is taken from a ball centering at 0), it must hold that S = 0. So by the estimation in the beginning of the proof, we have i=0 B(t, i + 1)η (i, x0, xi) ϵ lim t St = 0 provided limt xt = 0. The proof completes. Lemma C.3. Suppose that limt xt = 0. Then i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i) = 0. Proof. The estimate gives i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i) i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i) i=0 C(t + 1 + i, t + 1) 1 η (t + 1 + i, x0, xt+1+i) i=0 C(t + 1 + i, t + 1) 1 η(t + 1 + i, x0, xt+1+i) Ada Grad Avoids Saddle Points i=0 C(t + 1 + i, t + 1) 1 γt+1+iϵ xt+1+i i=0 C(t + 1 + i, t + 1) 1 γt+1+i xt+1+i . Since Γ(x0) = limt γ(t, x0) is a continuous function of x0, so if x0 is taken from a compact ball centering at 0, limt γ(t, x0) is bounded above uniformly. Thus there exists a constant K > 0 such that γ(t, x0) K for all t N, x0 B(0), x+ 0 = a. On the other hand, the norm of xt+1+i for i 0 satisfies xt+1+i sup k>t xk , where supk>t xk only depends on t. So i=0 C(t + 1 + i, t + 1) 1 γt+1+i xt+1+i ϵK sup k>t xk i=0 C(t + 1 + i, t + 1) 1 . (C.16) The definition of C(m, n) gives the following C(t + 1 + i, t + 1) 1 = (I γt+1H ) 1 (I γt+1+i H ) 1 1 γt+1λs+1 ... 1 γt+1λd 1 γt+1+iλs+1 ... 1 γt+1+iλd 1 1 γt+1λs+1 ... 1 1 γt+1λd 1 1 γt+1+iλs+1 ... 1 1 γt+1+iλd 1 (1 γt+1λs+1) (1 γt+1+iλs+1) ... 1 (1 γt+1λd) (1 γt+1+iλd) Recall that 0 > λs+1 ... λd so we have that 1 (1 γt+1λs+1) (1 γt+1+iλs+1) ... 1 (1 γt+1λd) (1 γt+1+iλd), and then the operator norm of C(t + 1 + i, t + 1) 1 is C(t + 1 + i, t + 1) 1 = 1 (1 γt+1λs+1) (1 γt+1+iλs+1). Furthermore, γt = γ(t, x0) c, i.e., is uniformly bounded away from 0 if x0 is taken from a ball centering at 0, and then the following inequality holds: 1 1 γtλs+1 1 1 cλs+1 . Thus the norm satisfies C(t + 1 + i, t + 1) 1 = 1 (1 γt+1λs+1) (1 γt+1+iλs+1) Ada Grad Avoids Saddle Points Since the ratio 1 1 cλs+1 is less than 1, the series on the right hand side of the following inequality, i=0 C(t + 1 + i, t + 1) 1 converges to the finite number C = 1 cλs+1 . Combining with the estimation in the beginning and C.16, we can conclude that i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i) (C.17) i=0 C(t + 1 + i, t + 1) 1 γt+1+i xt+1+i (C.18) ϵK sup k>t xk i=0 C(t + 1 + i, t + 1) 1 (C.19) ϵK sup k>t xk = ϵKC sup k>t xk (C.21) = ϵK 1 cλs+1 sup k>t xk . (C.22) By the assumption that limt xt = 0, we have that supk>t xk as t , so i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i) 0 as t , and the proof completes. We summerize two important properties of the integral operator T: Suppose {xt}t N is sequence in a ball centering at 0 such that limt xt = 0 and x+ 0 = a. Then the transformed sequence {(Tx)t}t N satisfies 1. (Tx)+ 0 = a; 2. limt (Tx)t = 0. In another word, the operator T transforms a sequence converging to 0 whose stable component of the 0 th term is a to another sequence converging to 0 whose stable component of the 0 th term is a. Let X(a, 0) = {{xt}t N : limt xt = 0, x+ 0 = a}. Define the metric on X(a, 0) as follows: Definition 1. Let u = {ut}t N and v = {vt}t N be two sequences in X(a, 0), then the distance d(u, v) is defined as d(u, v) = sup n 0 { un vn }. The following lemma shows that the sequences satisfying above two properties form a complete metric space. Lemma C.4. X(a, 0) is a complete metric space. Proof. Let u1 = {u1,j}j N, u2 = {u2,j}j N,..., {ui,j}j N... be a sequence of sequences that converges to 0 and u+ i,0 = a, i.e., the 0 th term of ui has stable component equal to a. Suppose that {ui}i N is Cauchy, i.e., given any ϵ > 0, there exists an integer L > 0, such that d(un, um) = sup j 0 { un,j um,j } < ϵ Ada Grad Avoids Saddle Points for all n, m > L. This implies that for each j, un,j um,j < ϵ for all n, m > L. Furthermore, this is equivalent to say that each sequence {uk,j}k N with fixed j, is Cauchy. Therefore, for each j, there exists a point u ,j such that lim k uk,j = u ,j We denote the limit sequence u = {u ,j}j N. Letting m , we have that un,j u ,j < ϵ for all n > L. Since limj un,j = 0, we can conclude that lim j u ,j = 0, and this means that the limit sequence u belongs to X(a, 0). Denote X(0, a, δ) the space of sequences {xt}t N satisfying limt xt = 0; xt δ for all t N. Lemma C.5. Suppose (ϵ, δ) is a pair of positive constants such that the Lipschitz condition is satisfied. Then ϵ can be adjusted so that for a sequence {xt}t N with x+ 0 = a ( a δ), x+ t δ, x t δ for all t 0, the transformed sequence Tx satisfies (Tx)+ 0 = a, (Tx)+ t δ, and (Tx) t δ for all t 0. Proof. When t = 0, we have i=0 C(i, 0) 1η (i, x0, xi) and then the estimate gives the following i=0 C(i, 0) 1η (i, x0, xi) (C.23) i=0 C(i, 0) 1η (i, x0, xi) (C.24) i=0 C(i, 0) 1 η (i, x0, xi) (C.25) Since η (i, x0, x) η(i, x0, x) = γ(i, x0) θ(xi) , with the Lipschitz condition that θ(x) ϵ x η (i, x0, x) γ(i, x0) θ(xi) (C.26) γ(i, x0)ϵ xi (C.27) = γ(i, x0)ϵ q x+ i 2 + x i 2 (C.28) γ(i, x0)ϵ p δ2 + δ2 (C.29) Ada Grad Avoids Saddle Points = γ(i, x0)ϵ Then the estimate of (Tx) 0 can be done as follows. i=0 C(i, 0) 1 η (i, x0, xi) (C.31) i=0 C(i, 0) 1 γ(i, x0) i=0 C(i, 0) 1 The last inequality is due to that γ(0, x0) is uniformly bounded above by 1 ξ. As long as ϵ is chosen according to we have that (Tx) 0 δ. To estimate the norm (Tx)+ t and (Tx) t for t 1, it is equivalent to estimate (Tx)t+1 for t 0. We have that (Tx)+ t+1 B(t, 0)a + i=0 B(t, i + 1)η+(i, x0, xi) (C.36) i=0 B(t, i + 1) η+(i, x0, xi) (C.37) η+(i, x0, xi) η(i, x0, xi) (C.38) = γ(i, x0) θ(xi) (C.39) γ(i, x0)ϵ xi (C.40) Furthermore, recall that B(t, i + 1) = (I γ(t, x0)H+)...(I γ(i + 1, x0)H+), and γ(i, x0) c, the norm satisfies B(t, i + 1) = (1 γ(t, x0)λs)...(1 γ(i + 1, x0)λs) (C.42) (1 cλs)...(1 cλs) | {z } t i copies = (1 cλs)t i. (C.44) i=0 B(t, i + 1) = i=0 (1 cλs)t i (C.45) Ada Grad Avoids Saddle Points = 1 (1 cλs)t+1 1 (1 cλs) (C.46) = 1 (1 cλs)t+1 1 cλs (C.48) (Tx)+ t+1 B(t, 0)a + i=0 B(t, i + 1) γ(i, x0)ϵ B(t, 0)a + γ(0, x0)ϵ i=0 B(t, i + 1) B(t, 0) δ + γ(0, x0)ϵ (1 cλs)t+1δ + 1 ξ ϵ (1 cλs)δ + 1 ξ ϵ The last inequality gives (Tx)+ t+1 (1 cλs) + r2 and the ϵ should be chosen according to (1 cλs) + r2 < 1 ϵ < c2λ2 s so that (Tx)+ t+1 < δ for all t 0. To estimate (Tx) t+1 , we firstly recall that i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i). i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xi) (C.54) i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, x0, xt+1+i) (C.55) i=0 C(t + 1 + i, t + 1) 1 η (t + 1 + i, x0, xt+1+i) (C.56) η (t + 1 + i, x0, xt+1+i) η(t + 1 + i, x0, xt+1+i) (C.57) = γ(t + 1 + i, x0) θ(xt+1+i) (C.58) Ada Grad Avoids Saddle Points γ(t + 1 + i, x0)ϵ xt+1+i (C.59) γ(t + 1 + i, x0)ϵ And then we have (Tx) t+1 r2 i=0 C(t + 1 + i, t + 1) 1 ξ ϵδ 1 cλs+1 From the above estimate we need to chose ϵ according to ξ ϵ 1 cλs+1 and then (Tx) t+1 δ. In summery, ϵ needs to be chosen according to ξ 2, c2λ2 s ξ 2 min{ cλs+1, c2λ2 s}, (C.66) so that the operator T maps the sequences in the way described in the lemma. Lemma C.6. Suppose that the function η(t, z, x) has the form of η(t, z, x) = γ(t, z)θ(x) and θ(x) satisfies that for any ϵ > 0, there exists δ > 0, such that θ(x) θ(y) ϵ x y for all x, y B(δ). Then T is a contraction mapping on X(0, a, δ), i.e., there exists a positive constant κ < 1 such that d(Tu, Tv) κd(u, v) for all u, v X(0, a, δ). Remark 5. In this lemma we use γ(t, z) for any sequences ignoring that different sequences have different initial conditions. This is because for any sequence γ(t, z), we can associate with a dynamical system, e.g., the initial one we are interested in, and we are actually looking for the stable manifold of this dynamical system at 0, so the sequence γ(t, z) can be regarded independent of the sequences {xt}t N fed into T. Proof. Let u = {ut}t N and v = {vt}t N be two sequences in X(0, a, δ). Then the 0 th term of the difference is (Tu)0 (Tv)0 = u+ 0 i=0 C(i, 0) 1η (i, u0, ui) i=0 C(i, 0) 1η (i, v0, vi) Ada Grad Avoids Saddle Points = (u+ 0 v+ 0 ) i=0 C(i, 0) 1(η (i, u0, ui) η (i, v0, vi)) i=0 C(i, 0) 1(η (i, u0, ui) η (i, v0, vi)) i=0 C(i, 0) 1(η (i, u0, ui) η (i, v0, vi)). (C.70) So the norm is estimated as follows: (Tu)0 (Tv)0 = (Tu) 0 (Tv) 0 (C.71) i=0 C(i, 0) 1(η (i, z, ui) η (i, z, vi)) (C.72) i=0 C(i, 0) 1 η (i, z, ui) η (i, z, vi) (C.73) i=0 C(i, 0) 1 η(i, z, ui) η(i, z, vi) (C.74) i=0 C(i, 0) 1 γ(i, z)θ(ui) γ(i, z)θ(vi) (C.75) i=0 C(i, 0) 1 γ(i, z) θ(ui) θ(vi) . (C.76) Since θ(ui) θ(vi) ϵ ui vi , and γ(i, z) γ(0, z), we have that (Tu)0 (Tv)0 i=0 C(i, 0) 1 γ(i, z)ϵ ui vi (C.77) i=0 C(i, 0) 1 ui vi (C.78) i=0 C(i, 0) 1 sup i 0 ut vt (C.79) i=0 C(i, 0) 1 d(u, v) (C.80) = γ(0, z)ϵ 1 cλs+1 d(u, v) (C.81) 1 ξ ϵ 1 cλs+1 d(u, v). (C.82) Therefore, the first condition for ϵ so that 1 ξ ϵ 1 cλs+1 which implies that ϵ < cλs+1 p The rest terms,i.e., (Tx)t+1 with t 0, give the following estimate. (Tu Tv)t+1 (C.83) Ada Grad Avoids Saddle Points = (Tu)t+1 (Tv)t+1 (C.84) B(t, 0)u+ 0 + i=0 B(t, i + 1)η+(i, z, ui) i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, z, ut+1+i) B(t, 0)v+ 0 + i=0 B(t, i + 1)η+(i, z, vi) i=0 C(t + 1 + i, t + 1) 1η (t + 1 + i, z, vt+1+i) B(t, 0)(u+ 0 v+ 0 ) + i=0 B(t, i + 1)(η+(i, z, ui) η+(i, z, vi)) i=0 C(t + 1 + i, t + 1) 1(η (t + 1 + i, z, ut+1+i) η (t + 1 + i, z, vt+1+i)) i=0 B(t, i + 1)(η+(i, z, ui) η+(i, z, vi)) i=0 C(t + 1 + i, t + 1) 1(η (t + 1 + i, z, ut+1+i) η (t + 1 + i, z, vt+1+i)) So the norm of the difference is the following (Tu)t+1 (Tv)t+1 i=0 B(t, i + 1) η+(i, z, ui) η+(i, z, vi) (C.93) i=0 C(t + 1 + i, t + 1) 1 η (t + 1 + i, z, ut+1+i) η (t + 1 + i, z, vt+1+i) . (C.94) Since η(t, z, x) = γ(t, z)θ(x), and then by Lipschitz condition on θ(x), we have that η+(i, z, ui) η+(i, z, vi) η(i, z, ui) η(i, z, vi) (C.95) = γ(i, z)θ(ui) γ(i, z)θ(vi) (C.96) = γ(i, z)ϵ θ(ui) θ(vi) (C.97) γ(i, z)ϵ ui vi . (C.98) Same argument gives estimate on η : η (t + 1 + i, z, ut+1+i) η (t + 1 + i, z, vt+1+i) γ(t + 1 + i, z)ϵ ut+1+i vt+1+i . The norm (Tu)t+1 (Tv)t+1 can be estimated as (Tu)t+1 (Tv)t+1 i=0 B(t, i + 1) γ(i, z)ϵ ui vi (C.99) i=0 C(t + 1 + i, t + 1) 1 γ(t + 1 + i, z)ϵ ut+1+i vt+1+i (C.100) i=0 B(t, i + 1) γ(i, z)ϵ sup t 0 ut vt (C.101) Ada Grad Avoids Saddle Points i=0 C(t + 1 + i, t + 1) 1 γ(t + 1 + i, z)ϵ sup t 0 ut vt (C.102) i=0 B(t, i + 1) γ(i, z)ϵd(u, v) (C.103) i=0 C(t + 1 + i, t + 1) 1 γ(t + 1 + i, z)ϵd(u, v) (C.104) Since γ(i, z) γ(0, z), the above inequality can be simplified to (Tu)t+1 (Tv)t+1 i=0 B(t, i + 1) + i=0 C(t + 1 + i, t + 1) 1 γ(0, z)ϵd(u, v) (C.105) cλs 1 cλs+1 γ(0, z)ϵd(u, v) (C.106) cλs 1 cλs+1 1 ξ ϵd(u, v) (C.107) The coefficient 1 cλs 1 cλs+1 needs to be adjusted so that it is less than 1, and this can be achieved by tuning ϵ so that ϵ < cλsλs+1 ξ λs+1 λs . Since cλs+1 q ξ 2 < cλs+1 ξ, combining with the condition for ϵ in the previous lemma, we have that ξ 2, c2λ2 s ξ 2, cλsλs+1 ξ λs+1 λs suffices to make T a contraction mapping, i.e., there exists κ < 1 such that d(Tu, Tv) κd(u, v). The following lemma guarantees that by a linear transformation, whose matrix comes from diagonalization of Hessian matrix at saddle point, the stable manifold of the diagonalized dynamical system can be carried to the stable manifold of the diagonalizable dynamical system. Lemma C.7. Let G be a diagonalizable real matrix. Then the dynamical system xt+1 = (I γ(t, x0)G)xt γ(t, x0)θ(xt). has a local stable manifold at 0. Proof. In this proof, we denote γt = γ(t, x0) for short. Since G is diagonalizable, there exists an invertible matrix Q such that G = Q 1HQ, Ada Grad Avoids Saddle Points and hence we have QGQ 1 = H, Consider the linear map z = φ(x) = Qx, note that φ induces a new dynamical system of z: Q 1zt+1 = (I γt G)Q 1zt γtθ(Q 1zk). Multiplying by Q from the left on both sides, we have zt+1 = Q(I γt G)Q 1zt γt Qθ(Q 1zt) (C.108) = (I γt H)zt γtˆθ(zt), (C.109) where ˆθ(zt) = Qθ(Q 1zt). We next verify that ˆθ satisfies the Lipschitz condition, i.e., given any ϵ > 0, there exists a δ > 0, such that ˆθ(w1) ˆθ(w2) = Qθ(Q 1w1) Qθ(Q 1w2) (C.110) ϵ w1 w2 (C.111) for all w1, w2 B(0, δ ). For any given ϵ > 0 and a fixed linear isomorphism Q, with respect to there exists a δ > 0, such that θ(u1) θ(u2) ϵ Q Q 1 u1 u2 for all u1, u2 B(0, δ). Denote V := Q (B(0, δ)) , i.e., V = {w Rd : w = Q(u) for some u B(0, δ)}. Since Qu is a linear diffeomorphism (change of basis is of full rank) from the open ball B(0, δ) to Rd, V is an open neighborhood of 0. Therefore, there exists an open ball at 0 with radius δ , denoted as B(0, δ ), such that B(0, δ ) V . By definition of V , we have that for any w1, w2 B(0, δ ) V , there exist u1, u2 B(0, δ), such that ( w1 = Qu1 w2 = Qu2, (C.112) and the inverse transformation is given by ( u1 = Q 1w1 u2 = Q 1w2. (C.113) And then we have ˆθ(w1) ˆθ(w2) = Qθ(Q 1w1) Qθ(Q 1w2) (C.114) Ada Grad Avoids Saddle Points = Qθ(u1) Qθ(u2) (C.115) Q θ(u1) θ(u2) (C.116) Q ϵ Q Q 1 u1 u2 (C.117) = Q ϵ Q Q 1 Q 1w1 Q 1w2 (C.118) Q ϵ Q Q 1 w1 w2 (C.119) = ϵ w1 w2 . (C.120) The verification completes. Thus the stable manifold (measure zero set of initial condition) of dynamical system with diagonal linear part can be carried to dynamical system with diagonalizable linear part by the linear map Q. D Proof of Proposition 3 We proceed by showing that Ada Grad-Norm, Ada Grad-Diag and Full Ada Grad are diffeomorphism if the parameter δ0 is chosen properly. Throughout the proof, we denote ξ = δ2 0 for convenience. Proof. By diffeomorphism, we mean the last iterate acting on xt is a diffeomorphism. Recall that the Ada Grad algorithms have the following form: xt+1 = xt Γt f(xt), s=1 f(xs) f(xs) ! 1 for the Full Ada Grad, which we consider first. Since the matrix Γt can be written as Γt = (ξI + S + f(xt) f(xt)) 1 s=1 f(xs) f(xs), then the update rule considered a mapping acting on xt can be reduce to the following form φ(x) = x Ω(x) f(x) Ω(x) = Γt(x) = ω11 . . . ω1d ... ωd1 . . . ωdd and the entries ωij are functions of x, i.e., ωij = ωij(x). The differential of φ(x) is Dφ(x) = I D (Ω(x) f(x)) By matrix differentiation rule, we have D (Ω(x) f(x)) = x... Ωd(x) f(x) Ada Grad Avoids Saddle Points x + f(x) Ω1(x) x ... Ωd(x) f(x) x + f(x) Ωd(x) Note that the matrix x ... Ωd(x) f(x) can have bounded norm as small as possible if ξ is taken to be small enough. On the other hand, is of small norm if ξ is taken to be large. Note that the matrix Ωsatisfies the following property Ω2(x) ξI + S + f(x) f(x) = I, where S is a matrix independent of x, and (Ω2(x))ij = X Therefore X k (Ω2)ik(ξδkj + Skj + kf jf) (D.3) (ξδkj + Skj + kf jf) = δij (D.4) Take partial derivative with respect to any xα, we have that (ξδkj + Skj + kf jf) + ! xα (ξδkj + Skj + kf jf) By assumption that xα ( kf jf) is bounded and since ξ can be taken as large as possible, the latter term X ! xα (ξδkj + Skj + kf jf) is uniformly bounded. And thus xα which will implies that the norm of Ω x approaches 0 as ξ . We have completed the proof that when ξ , det Dφ 1, which means that φ is a diffeomorphism. The above argument automatically applies to Ada Grad-Diag, to see that Ada Grad-Norm is also a diffeomorphism, we take an explicit calculation. The algorithm xt+1 = xt 1 q Pt i=0 f(xi) 2 + ξ f(xt) Ada Grad Avoids Saddle Points can be written as the update rule g(t, x) = x 1 p St + f(x) 2 f(x) i=0 f(xi) 2 + ξ. For each fixed t N, we can ignore the t and denote and keep in mind that S can be chosen as large as possible. Then the update rule is written as g(x) = x γ(x) f(x). The Jacobian of g is computed by Dg(x) = I γ(x) f(x) + γ(x) 2f(x) (D.5) The term γ(x) 2f(x) is bounded and can be arbitrarily small if γ(x) is small because 2f(x) is assumed to be bounded. The first term γ(x) f(x) is a matrix with entries in the form of γ xi Since in the beginning we assume the algorithm runs in a compact subset of Rd, f xj are uniformly bounded. Moreover S + f(x) 2 (D.6) 2(S + f(x) 2) 3 2(S + f(x) 2) 3 2(S + f(x) 2) 3 Recall that for a matrix A, the ℓ1 norm is and the inequality 1 n A 1 A 2 n A 1 shows the boundedness of 2f xi xk provided 2f(x) 2 < L, even without the compactness assumption. Thus we have proven that γ xi can be arbitrarily small as long as S is large enough. And this shows that at each point x, the Jacobian Dg(x) can be arbitrarily close to the identity I. Since the determinant det(Dg(x)) is a continuous function of the entries of Dg(x), so det(Dg(x)) 1, as γ(x) 0, and there exists a large enough S (so that γ(x) is close enough to 0) at each point x, such that det(Dg(x)) is strictly positive. If x is in a compact set, there exists a uniform S such that the update rule g(x) is a local diffeomorphism everywhere. Ada Grad Avoids Saddle Points E Proof of Theorem 2 We complete the proof of the last theorem by extending local existence of stable manifold to global, and then the measure zero result of initial condition follows from the fact that the manifold is of lower dimension than that of Rd. Proof. Note that the dynamical system defined by Ada Grad algorithms determines each iterate based on time t and the initial condition x0, thus the t th iterate can be thought as the image of a mapping depending on t adn x0, we denote this mapping by ψ(t, x0), i.e. xt+1 = ψ(t, x0). We define ψ(m, n, x) = ψ(m, ..., ψ(n + 1, ψ(n, x))...) for m > n. The stable set of a set of fixed point A , denoted by W s(A ), of the dynamical system defined by ψ(t, x) is W s(A ) = {x0 : lim k ψ(k, 0, x0) A }. Fix a point x0 W s(A ). Since ψ(k, 0, x0) x A , there exists some non-negative integer T and all t T, such that ψ(t, 0, x0) [ So ψ(t, 0, x0) Ux i for some x i A and all t T. This is equivalent to ψ(T + k, T, ψ(T, 0, x0)) Ux i for all k 0, and this implies that ψ(T, 0, x0) ψ 1(T + k, T, Ux i ) for all k 0. And then we have ψ(T, 0, x0) k=0 ψ 1(T + k, T, Ux i ). Denote Si,T := T k=0 ψ 1(T + k, T, Ux i ) and the above relation is equivalent to x0 ψ 1(T, 0, Si,T ). Take the union for all nonnegative integers T, we have T =0 ψ 1(T, 0, Si,T ). And union for all i we obtain that T =0 ψ 1(T, 0, Si,T ) implying that T =0 ψ 1(T, 0, Si,T ). Since Si,T Wn(x ), and Wn(x ) has codimension at least 1. This implies that Si,T has measure 0 with respect to the volume measure from the Riemannian metric on M. Since the image of set of measure zero under diffeomorphism is of measure zero, and countable union of zero measure sets is still measure zero, we obtain that W s(A ) is of measure zero.