# heavyball_algorithms_always_escape_saddle_points__1d5c4e53.pdf Heavy-ball Algorithms Always Escape Saddle Points Tao Sun1 , Dongsheng Li1 , Zhe Quan2 , Hao Jiang1 , Shengguo Li1 and Yong Dou1 1 Computer of College, National University of Defense Technology 2 Hunan University nudtsuntao@163.com, dsli@nudt.edu.cn, quanzhe@hnu.edu.cn,{haojiang, nudtlsg, yongdou}@nudt.edu.cn Nonconvex optimization algorithms with random initialization have attracted increasing attention recently. It has been showed that many first-order methods always avoid saddle points with random starting points. In this paper, we answer a question: can the nonconvex heavy-ball algorithms with random initialization avoid saddle points? The answer is yes! Direct using the existing proof technique for the heavy-ball algorithms is hard due to that each iteration of the heavy-ball algorithm consists of current and last points. It is impossible to formulate the algorithms as iteration like xk+1 = g(xk) under some mapping g. To this end, we design a new mapping on a new space. With some transfers, the heavy-ball algorithm can be interpreted as iterations after this mapping. Theoretically, we prove that heavy-ball gradient descent enjoys larger stepsize than the gradient descent to escape saddle points to escape the saddle point. And the heavyball proximal point algorithm is also considered; we also proved that the algorithm can always escape the saddle point. 1 Introduction We consider the smooth nonconvex optimization problem minimizex RN f(x) (1) where f is a differentiable closed function whose gradient is L-Lipschitz continuous (may be nonconvex). And more we assume f is C2 over RN, that is, f is twice-differentiable and 2f(x) is continuous. This paper is devoted to the study of two heavy-ball algorithms for minimizing f with random initializations. Specifically, we focus on whether these two algorithms escape the saddle points. 1.1 Heavy-ball Algorithms We consider the Heavy-Ball Gradient Descent (HBGD) with random initializations x0, x1 presented as xk+1 = xk γ f(xk) + β(xk xk 1), k 1, (2) Corresponding Author. where γ is the stepsize and β is the inertial parameter. If β = 0, the algorithm then reduces to the basic gradient descent. The heavy-ball method is also named as momentum method and has been widely used in different cases to accelerate the gradient descent iteration. Theoretically, it has been proved that HBGD enjoys better convergence factor than both the gradient and Nesterov s accelerated gradient method with linear convergence rates under the condition that the objective function is twice continuously differentiable, strongly convex and has Lipschitz continuous gradient. With the convex and smooth assumption on the objective function, the ergodic O(1/k) rate in terms of the objective value, i.e., f Pk i=1 xi k min f = O 1 k , is proved by [Ghadimi et al., 2015]. HBGD was proved to converge linearly in the strongly convex case by [Ghadimi et al., 2015]. But the authors used a somehow restrictive assumption on the parameter β, which leads to a small range values of choose for β when the strongly convex constant is tiny. Due to the inertial term β(xk xk 1), HBGD breaks Fej er monotonicity that gradient descent obeys. Consequently, it is difficult to prove its non-ergodic convergence rates. To this end, [Sun et al., 2019] designed a novel Lyapunov analysis and derived better convergence results like non-ergodic sublinear convergence rate f(xk) min f = O 1 k and larger stepsize for linear convergence. [Combettes and Glaudin, 2017] developed and studied heavy-ball methods for operator research. In the nonconvex case, if the objective function is nonconvex but Lipschitz differentiable, [Zavriev and Kostyuk, 1993] proved that the sequence generated by the heavy-ball gradient descent is convergent to a critical point, but without specifying related convergence rates. With semi-algebraic assumption on the objective function, [Ochs et al., 2014] proved the sequence generated by HBGD is convergent to some critical point. Actually, [Ochs et al., 2014] also extended HBGD to a nonsmooth composite optimization problem. [Xu and Yin, 2013] developed and analyzed the Heavy-ball algorithm for tensor minimization problems. [Loizou and Richt arik, 2017a; Loizou and Richt arik, 2017b] introduced the stochastic versions of heavy-ball algorithms and its relations with existing methods. On the other hand, [Alvarez, 2000] studied a secondorder ODE whose implicit discretization yields the Heavy Ball Proximal Point Algorithm (HBPPA) mathematically de- Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) xk+1 = Proxγf(xk + β(xk xk 1)), (3) which admits the acceleration of the proximal point algorithm. In fact, the setting β = 0 will push the scheme back to be the basic proximal point algorithm. The convergence of HBPPA is studied by [Alvarez, 2000] with several assumptions on the parameters γ and β when f is convex. Noticing the fact that f is actually the maximal monotone operator in the convex case, [Alvarez and Attouch, 2001] extended HBPPA to solve the inclusion problem. The inexact HBPPA was proposed and studied by [Moudafiand Elizabeth, 2003]. Existing works on HBPPA are all based on the convex assumption on the f. In this paper, we will present the convergence of nonconvex HBPPA including the gradient convergence under smooth assumption and sequence convergence under semialgebraic assumption. The performances of HBGD and HBPPA with convex settings exclude the interest of this paper. What we care is that can these two algorithms escape the saddle points when f is nonconvex? This paper answers this problem. 1.2 First-order Algorithms Escape Saddle Points In the community of nonconvex continuous optimization, saddle points, which are usually unavoidable and outnumber the cardinality of local minima, have long been regarded as a major obstacle. In many applications like tensor decomposition [Ge et al., 2015], matrix completion [Ge et al., 2016] and dictionary learning [Sun et al., 2017], all local optima are also globally optimal. In some networks, under several conditions, local optimas are close to the global optima [Chaudhari et al., 2016]. Although these conditions are barely satisfied by many networks, the conclusion coincides with existing experimental observations. Thus if all local minimizers are relatively good, we just need to push the iterations to escape the saddle points. And how to escape the saddle points has been a hot topic recently for firstorder optimization algorithms. [Pemantle, 1990] established the convergence of the Robbins-Monro stochastic approximation to local minimizers by introducing sufficiently large unbiased noise. For tensor problem, [Ge et al., 2015] provided the quantitative rates on the convergence of noisy gradient descent. For general smooth nonconvex functions, the analysis of noisy gradient descent is established by [Jin et al., 2017a]. In a latter paper, [Jin et al., 2017b] considered the noisy accelerated gradient descent and established related convergence theory. Another technique to escape the saddle points is using random initializations. It has been shown that, even without any random perturbations, a good initialization can make gradient descent converge to the global minimum for various non-convex problems like matrix factorization [Keshavan et al., 2010] , phase retrieval [Cai et al., 2016], dictionary learning [Arora et al., 2015]. [Lee et al., 2016; Lee et al., 2017] proved that for plenty of first-order nonconvex optimization algorithms, with random initializations, they can converge to local minima with probability 1. In the perspective of mathematics, the core techniques consist of two parts: 1). they reformulate the algorithms as mapping iterations; 2). the stable manifold theorem [Shub, 2013] is employed. In this paper, we establish the theoretical guarantees for heavy-ball algorithms escaping the saddle points. 1.3 Difficulty in Analyzing Heavy-ball Iterations The difficulty of study on the performance of heavy-ball with random initializations lies in how to reformulate the algorithm as iterations under a mapping. Taking the gradient descent for example, one can use g := I γ f with γ being the stepsize. And then, the gradient descent can be represented as xk+1 = g(xk). Constructing the mapping for gradient descent is obvious and easy. But for heavy-ball ones, the construction tasks are quite different due to the inertial term β(xk xk 1). Point xk+1 is generated by both the current variable xk and the last one xk 1. Thus, it is impossible to find a mapping g to rewrite the heavy-ball as xk+1 = g(xk). 1.4 Contribution Due to that the analysis of nonconvex HBPPA is still blank, we establish related theory in this paper. And with the recently developed techniques in variational analysis, the sequence convergence of nonconvex HBPPA is also studied. But our main interests are still the analysis of HBGD and HBPPA with random initializations. We prove guarantees for both algorithms to escape saddle points under mild assumptions. Direct using the existing method is unapplicable for heavyball algorithms. Thus, we propose a novel construction technique. It can be proved that heavy-ball gradient descent enjoys larger stepsize than gradient descent to escape the saddle points with random starting points, which somehow indicates the advantages of heavy-ball methods. 2 Preliminaries In this part, we introduce several basic definitions and tool lemmas. For a closed function J (not necessary to be convex), the proximal map is defined as Prox J( ) := arg min x Under the condition that J is differentiable, if y = Prox J(z), the K.K.T. condition can yields that J(y) + y z = 0. Given another point arbitrary x RN, it holds that J(y) + y z 2 2 J(x) + x z 2 For matrices A and B, A B denotes that there exist perturbation matrices P, Q such that A = PBQ. We use IN and 0N to denote the N N unit and zero matrix, respectively. det(A) is determinant of matrix A. It is easy to see if A B, det(A) = 0 det(B) = 0. Definition 1 (Strict Saddle) We present the following two definitions, 1. A point x is a critical point of f if f(x ) = 0. x is isolated if there is a neighborhood U around x , and x is the only critical point in U. 2. A point x is a strict saddle point of f if x is a critical point and λmin( 2f(x )) < 0. Let χ f denote the set of strict saddle points of function f. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Let g be a mapping; and its domain is defined as dom(g) := {x | g(x) = }. And gk denotes the k-fold composition of g. Our interest is the sequence generated by xk = g(xk 1) = gk(x0). (6) Definition 2 (Global Stable Set) The global stable set of the strict saddles is the set of initial conditions where iteration of the mapping g converges to a strict saddle, i.e., Wg := {x0 : limk gk(x0) χ }. For an iterative algorithm M, WM := {x0 : limk xk χ } and {x1, . . .} is generated by M with starting point x0. Definition 3 (Unstable fixed point) The differential of the mapping g, denoted as Dg(x), is a linear operator from T (x) T (g(x)), where T (x) is the tangent space of X at point x. Given a curve γ in X with γ(0) = x and dγ dt (0) := v T (x), the linear operator is defined as Dg(x)v = d(g γ) dt (0) T (g(x)). Let A g = {x : g(x) = x, λ(Dg(x)) such that |λ(Dg(x))| > 1} be the set of fixed points where the differential has at least a single real eigenvalue with magnitude greater than one. These are the unstable fixed points. With previous definitions, the authors in [Lee et al., 2017] presented the following result. The core technique is employing the stable manifold theorem given in [Shub, 2013]. Lemma 1 Let g be a C1 mapping and det(Dg(x)) = 0 for all x dom(g). Assume the sequence (xk)k 0 is generated by scheme (6). Then the set of initial points that converge to an unstable fixed point has measure zero, µ({x0 : lim xk A g}) = 0. Further if X A g, µ(Wg) = 0. 3 Main Results In this section, we present the theoretical guarantees to escape the saddle points for the heavy-ball algorithms presented before. We need a smooth assumption on the objective function. Assumption 1 The objective function f is twicedifferentiable and 2f(x) 2 L, where 2 denotes the spectral norm of a matrix. Otherwise the objective function satisfies that limx RN f(x) > . The smooth assumption is used to derive the differential of the designed mapping. And the lower bounded assumption is used to provide a lower bound for the descent variables. Lemma 2 Let (xk)k 0 be generated by the HBGD (2) with 0 < β < 1. If the stepsize is selected as 0 < γ < 2(1 β) we have µ{(x0, x1) : limk xk χ f} = 0 and µ{WHBGD} = 0. It is easy to see that if β < 1 2 the upper bound of the stepsize is therefore larger than 1 L for HBGD with random initializations. And the upper bound is closed to 2 L provided that β is sufficiently small. [Lee et al., 2016] proved that the stepsize for the GD with random initializations is required to be smaller than 1 L. Lemma 2 means a larger stepsize can be used under the heavy-ball scheme, which somehow demonstrates the advantage of heavy-ball method. The difficulty of the analysis for HBGD has been presented in Sec. 1: it is impossible to find to a mapping such that xk+1 = g(xk) directly. Noticing the kth iteration involves with xk and xk 1, we turn to consider combining the these two points together, i.e., (xk, xk 1). And then, the task then turns to find a mapping such that (xk+1, xk) = ˆg(xk, xk 1). The mapping ˆg is from R2N to R2N. After building the relations between the unstable fixed point of ˆg and the strict saddle points, the theorem then can be proved. Different from existing results in [Lemma 1,[Sun et al., 2019]], β = 0 here. This is to make sure that the differential of the constructed mapping in the proofs is invertible. And the differential is non-symmetric. Lemma 2 actually means Prob(x0 WHBGD) = 0, that is, HBGD always escapes saddle points provided the stepsize is set properly. Lemma 2 just means that (xk)k 0 will not fall into a saddle point. But whether the sequence converges or not is out of scope of this lemma. In the nonconvex optimization community, such a problem is answered by posing the semi-algebracity assumption on the objective function [Łojasiewicz, 1993; Kurdyka, 1998] or the isolated assumption on the critical point [Lange, 2013]. Thus, we can derive the following result. Theorem 1 Assume that conditions of Lemma 2 hold and f is coercive1. Suppose the starting points x0 and x1 are totally random and one of the following conditions is satisfied. 1. f is semi-algebraic. 2. All critical points of f are isolated points. Then, for HBGD, limk xk exists and that limit x is a local minimizer of f with probability 1. Now, we study the HBPPA. First, we provide the convergence in the nonconvex settings. Lemma 3 Let (xk)k 0 be generated by scheme (3) with 0 < β < 1 2 and γ > 0. Then, we have limk f(xk) = 0. Further more, if the function f is semi-algebraic and coercive, (xk)k 0 is convergent to some critical point x . To obtain the sequence convergence of HBPPA, besides the semi-algebraicity, the coercivity is also needed, which is used to guarantee the boundedness of the generated sequence. In Lemma 3, f is set to be differentiable. Actually, the smooth setting can be removed. Alternatively, we need to use the notion about subdifferentials of closed functions [Rockafellar and Wets, 2009] which are frequently used in the variational analysis. Specifically, in the smooth case, subdifferential is 1We say function coercive if f(x) + as x + . Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) actually the gradient. With the same parameters, the sequence of HBPPA is still provably to converge to a critical point of f. We are prepared to present the guarantee for HBPPA to escape the saddle point. Lemma 4 Let (xk)k 0 be generated by scheme (3) with 0 < β < 1 2, if the stepsize is selected as we have µ{(x0, x1) : limk xk χ f} = 0 and µ{WHBPPA} = 0. For HBPPA, the stepsize requirement coincides with existing result [Lee et al., 2017]. Main idea of proof of Lemma 4 is similar to Lemma 2, but more complicated in details. Theorem 2 Assume that conditions of Lemma 4 hold and f is coercive. Suppose the starting points x0 and x1 are totally random and one of the following conditions is satisfied. 1. f is semi-algebraic. 2. All critical points of f are isolated points. Then, for HBPPA, limk xk exists and that limit x is a local minimizer of f with probability 1. 4 Proofs This section collects the proofs for the this paper. 4.1 Proof of Lemma 2 To guarantee the convergence of HBGD, from [Lemma 1,[Sun et al., 2019]], we need 0 β < 1 and 0 < γ < 2(1 β) L . Let y, z RN and w := (y, z), denote a map as F(w) = F(y, z) := y γ f(y) + β(y z) y It is easy to check that if F(w) = 0 = y = z. We further denote wk := (xk, xk 1) R2N. It can be easily verified that the Heavy-ball methods can be reformulated as wk+1 = F(wk). (7) To use Lemma 1, we turn to the proofs of two facts: 1. det(DF(w)) = 0 for all w R2N. 2. χ f χ f T{(y, z) R2N | y = z} A F . Proof of fact 1: Direct calculations give us DF(w) = (1 + β)IN γ 2f(y) βIN IN 0N We use the short-hand notation A(y) := (1 + β)IN γ 2f(y). We can obtain the following relation DF(w) = A(y) βIN IN 0N 0N βIN IN 0N Therefore, we can derive det (DF(w)) = 0 for any w R2N. Proof of fact 2: For any x being a strict saddle, with the symmetry of 2f(x ), then λmax(A(x )) > 1 + β. For λ R and λ = 0 2, denoting w = (x , x ), we consider λI2N DF(w ) = λIN A(x ) βIN IN λIN After simplifications, we can get λIN A(x ) βIN IN λIN λIN A(x ) λIN That indicates det(λI2N DF(w )) = 0 det(λIN + β λIN A(x )) = 0. Denote function s(t) := t + β t , t > 0. All the eigenvalues of DF(w ) are the roots of the equation above. We exploit the two roots of λ + β λ = λmax(A(x )). It is easy to check that λmax(A(x ))2 4β > (1 β)2 0, which means this equation has two real roots denoted as 0 < λ < λ. Noting λmax(A(x )) > 1 + β, that is s(λ) > s(1). We can see that s(t) is increasing on [ β, + ). With the fact 0 β < 1, s(t) is increasing on [1, + ). Thus, there exists some real number λ0 > 1 such that λ > λ0. Thus, we get λmax(DF(w )) λ λ0 > 1, i.e., w A F . That is also {(x0, x1) : lim k xk χ f} {w0 : lim k wk A F }. Using Lemma 1, µ{w0 : limk wk A F } = 0 and then µ{(x0, x1) : lim k xk χ f} µ{w0 : lim k wk A F } = 0. On the other hand, due to the nonnegativity of the measurement, µ{(x0, x1) : lim k xk χ f} = 0. 4.2 Proof of Theorem 1 The first case has been proved in [Ochs et al., 2014], in which (xk)k 0 is convergent to some critical point x . For the second case, [Proposition 12.4.1, [Lange, 2013]] and equation (10) means that the stationary set of (xk)k 0 is finite. Otherwise, the stationary set is connected which contradicts with the fact all critical points are isolated. Once using [Proposition 12.4.1, [Lange, 2013]], the stationary set has only one element x ; and then, (xk)k 0 x . Therefore, in both cases, it can be proved that (xk)k 0 x RN. With Lemma 2, x is a local minimizer with probability 1 if the starting points x0, x1 are random. The theorem is then proved. 4.3 Proof of Lemma 3 This subsection contains two parts. Part 1 is to prove the gradient convergence, while Part 2 contains the sequence convergence proof. 2We have proved that DF(w) is nonsingular. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Part 1. In (5), by substituting y xk+1, z xk+β(xk xk 1), x xk, J γf, γf(xk+1) + xk + β(xk xk 1) xk+1 2 γf(xk) + β(xk xk 1) 2 After simplifications, we then get f(xk+1) + xk+1 xk 2 γ xk+1 xk, xk xk 1 2γ xk+1 xk 2 + β 2γ xk xk 1 2. (8) Denote a novel function as H(w) = H(x, y) := f(x) + β and sequence wk := (xk, xk 1). Thus, inequality (8) offers the descent as H(wk+1) + 1 2β 2γ xk+1 xk 2 H(wk), k 1. (9) With the fact infx RN f(x) > , infk H(wk) > . The setting 0 < β < 1 lim k xk+1 xk = 0. (10) On the other hand, from the property of the proximal map, γ f(xk) + xk+1 xk β(xk xk 1) = 0, (11) which can derive lim k f(xk) limk xk+1 xk + β limk xk xk 1 Part 2. We prove that H(x, y) is coercive provided f is coercive. If this claim fails to hold, there exists some C > 0 such that f(x) H(x, y) C as (x, y) + . With the coercivity of f, there exists some C1 > 0 such that x C1. On the other hand, x y is bounded due to the lower boundedness of f; thus, y is bounded, which contradicts the fact (x, y) + . From (9), we see that (wk)k 1 is bounded; and then, (xk)k 0 is bounded. The descent property of (H(wk))k 1 and the continuity of function H directly give that lim k H(wk) = H(w ). (13) That means the sequence has a stationary point x ; (12) means x admits the critical point of f. It is easy to see w := (x , x ) is the critical point of H. Without loss of generality, we assume wk = w as k 1. Noting polynomial functions are semi-algebraic and the fact that sum of semi-algebraic functions is still semialgebraic, H(w) is then semi-algebraic. Denote the set of all stationary points of (wk)k 0 as Ω. From [Lemma 3.6, [Bolte et al., 2014]], there exist η, ε (0, + ] and a continuous concave function ϕ : [0, η) R+ such that 1. ϕ(0) = 0; ϕ is C1 on (0, η); for all s (0, η), ϕ (s) > 0. 2. for w {w|H(w ) < H(w) < H(w ) + η} T{w|dist(w, Ω) < ε}, it holds that ϕ (H(w) H(w )) H(w) 1. (14) For the η, ε given above, as k is large enough, wk will fall into the set {w|H(w ) < H(w) < H(w ) + η} T{w|dist(w, Ω) < ε}. The concavity of ϕ gives ϕ(H(wk) H(w )) ϕ(H(wk+1) H(w )) ϕ (H(wk) H(w )) (H(wk) H(wk+1)) a) ν ϕ (H(wk) H(w )) xk+1 xk 2 b) ν xk+1 xk 2 H(wk) , (15) where ν = 1 2β 2γ , and a) uses 9, and b) comes from (14). The gradient of H satisfies H(wk) f(xk) + β γ ( xk+1 xk + xk xk 1 ), (16) where c) depends on (11) and, we used the fact 0 < β < 1 2. Combining (15) and (16), xk+1 xk + xk xk 1 ϕ(H(wk) H(w )) ϕ(H(wk+1) H(w )) νγ ϕ(H(wk) H(w )) ϕ(H(wk+1) H(w )) + xk+1 xk + xk xk 1 where d) employs the Schwarz inequality 2(xy) 1 2 x + y with x = ϕ(H(wk) H(w )) ϕ(H(wk+1) H(w )), and y = 1 2 p xk+1 xk + xk xk 1 . Summing both sides of (17) from sufficiently large k to K, we have i=k xi+1 xi + 7 4 x K+1 x K 1 νγ ϕ(H(wk) H(w )) ϕ(H(w K+1) H(w )) . Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Letting K + , from (10), (13), ϕ(0) = 0 and the continuity of ϕ, (18) then yields P+ i=k xi+1 xi 1 6 xk xk 1 + 4 νγ ϕ(H(wk) H(w )) < + . We are then led to P+ i=k xi+1 xi < + P+ i=0 xi+1 xi < + . That means (xk)k 0 is convergent to some point. With the fact, x is a stationary point, (xk)k 0 x . 4.4 Proof of Lemma 4 In this proof, we denote the mapping as F(w) = F(y, z) := Proxγf(y + β(y z)) y To calculate the differential of F, we denote g(y, z) := Proxγf(y + β(y z)). It is easy to see that DF(w) = g(y,z) The definition of the proximal map then gives 0 = γ f[g(y, z)] + [g(y, z) (y + β(y z))]. (20) By implicit differentiation on variable y, we can get 0 = γ 2f[g(y, z)] g(y, z) y + g(y, z) y (1 + β)IN. Due to that 0 < γ < 1 L, γ 2f[g(y, z)]+IN is invertible. We use a shorthand notation A(y, z) := (γ 2f[g(y, z)] + IN) 1. For z, by the same procedure, then we can derive y = (1 + β)A(y, z), g(y, z) z = βA(y, z). (21) With (19) and (21), we are then led to DF(w) = (1 + β)A(y, z) βA(y, z) IN 0N 0N βA(y, z) IN 0N That means for any w R2N, det(DF(w)) = 0. On the other hand, if x is a strict saddle, g(x , x ) = x , λmax(A(x , x )) > 1. For λ R, denoting w = (x , x ), we consider λI2N DF(w ) = λIN (1 + β)A(x , x ) βA(x , x ) IN λIN After simplifications, we can derive λ (1 + β)]A(x , x ) λIN With direct computations, we turn to det(λ2IN + (β (1 + β)λ)A(x , x )) = 0. We consider the following equation H(λ) := λ2 + (β (1 + β)λ)λmax(A(x , x )) = 0. Direct calculations give (1 + β)2λ2 max(A(x , x )) 4βλmax(A(x , x )) = 4βλmax(A(x , x ))(λmax(A(x , x )) 1) + (1 β)2λ2 max(A(x , x )) 0. Thus, the equation enjoys two real roots denoted by λ < λ. It is easy to see that λ > 0. Noticing that H(1) < 0 and limλ + H(λ) = + , we have λ > 1. That means λmax(DF(w )) λ > 1, thus, w A F . Consequently, the proof is proved by Lemma 1. 4.5 Proof of Theorem 2 This proof is similar to the one of Theorem 1 and will not be reproduced. 5 Conclusion In this paper, we proved that HBGD and HBPPA always escape the saddle points with random initializations. This paper also established the convergence of nonconvex HBPPA. The core part in the proofs is bundling current and the last point as one point in an enlarged space. The heavy algorithms then can be represented as an iteration after a mapping. An interesting finding is that the HBGD can enjoy larger stepsize than the gradient descent to escape saddle points. Acknowledgments This work is sponsored in part by National Key R&D Program of China (2018YFB0204300), and Major State Research Development Program (2016YFB0201305), and National Natural Science Foundation of Hunan Province in China (2018JJ3616), and National Natural Science Foundation for the Youth of China (61602166), and Natural Science Foundation of Hunan (806175290082), and Natural Science Foundation of NUDT (ZK18-03-01), and National Natural Science Foundation of China (11401580). References [Alvarez and Attouch, 2001] Felipe Alvarez and Hedy Attouch. An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set-Valued Analysis, 9(1-2):3 11, 2001. [Alvarez, 2000] Felipe Alvarez. On the minimizing property of a second order dissipative system in hilbert spaces. SIAM Journal on Control and Optimization, 38(4):1102 1119, 2000. [Arora et al., 2015] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding. 2015. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) [Bolte et al., 2014] J erˆome Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459 494, 2014. [Cai et al., 2016] T Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow. The Annals of Statistics, 44(5):2221 2251, 2016. [Chaudhari et al., 2016] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann Le Cun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. ar Xiv preprint ar Xiv:1611.01838, 2016. [Combettes and Glaudin, 2017] Patrick L. Combettes and Lilian E. Glaudin. Quasinonexpansive iterations on the affine hull of orbits: From mann s mean value algorithm to inertial methods. Siam Journal on Optimization, 27(4), 2017. [Ge et al., 2015] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points-online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797 842, 2015. [Ge et al., 2016] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973 2981, 2016. [Ghadimi et al., 2015] Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European, pages 310 315. IEEE, 2015. [Jin et al., 2017a] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724 1732. JMLR. org, 2017. [Jin et al., 2017b] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. ar Xiv preprint ar Xiv:1711.10456, 2017. [Keshavan et al., 2010] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980 2998, 2010. [Kurdyka, 1998] Krzysztof Kurdyka. On gradients of functions definable in o-minimal structures. In Annales de l institut Fourier, volume 48, pages 769 784. Chartres: L Institut, 1950-, 1998. [Lange, 2013] Kenneth Lange. Elementary optimization. In Optimization, pages 1 21. Springer, 2013. [Lee et al., 2016] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246 1257, 2016. [Lee et al., 2017] Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and Benjamin Recht. First-order methods almost always avoid saddle points. To be appeared in Mathematical Progamming, 2017. [Loizou and Richt arik, 2017a] Nicolas Loizou and Peter Richt arik. Linearly convergent stochastic heavy ball method for minimizing generalization error. ar Xiv preprint ar Xiv:1710.10737, 2017. [Loizou and Richt arik, 2017b] Nicolas Loizou and Peter Richt arik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. ar Xiv preprint ar Xiv:1712.09677, 2017. [Łojasiewicz, 1993] Stanislas Łojasiewicz. Sur la g eom etrie semi-et sous-analytique. Ann. Inst. Fourier, 43(5):1575 1595, 1993. [Moudafiand Elizabeth, 2003] Abdellatif Moudafi and E Elizabeth. Approximate inertial proximal methods using the enlargement of maximal monotone operators. International Journal of Pure and Applied Mathematics, 5(3):283 299, 2003. [Ochs et al., 2014] Peter Ochs, Yunjin Chen, Thomas Brox, and Thomas Pock. ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences, 7(2):1388 1419, 2014. [Pemantle, 1990] Robin Pemantle. Nonconvergence to unstable points in urn models and stochastic approximations. The Annals of Probability, 18(2):698 712, 1990. [Rockafellar and Wets, 2009] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009. [Shub, 2013] Michael Shub. Global stability of dynamical systems. Springer Science & Business Media, 2013. [Sun et al., 2017] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853 884, 2017. [Sun et al., 2019] Tao Sun, Penghang Yin, Dongsheng Li, Chun Huang, Lei Guan, and Hao Jiang. Non-ergodic convergence analysis of heavy-ball algorithms. AAAI, 2019. [Xu and Yin, 2013] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6(3):1758 1789, 2013. [Zavriev and Kostyuk, 1993] SK Zavriev and FV Kostyuk. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling, 4(4):336 341, 1993. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)