# the_physical_systems_behind_optimization_algorithms__d57d5e15.pdf The Physical Systems Behind Optimization Algorithms Lin F. Yang Princeton University lin.yang@princeton.edu Raman Arora, Johns Hopkins University arora@cs.jhu.edu Vladimir Braverman Johns Hopkins University vova@cs.jhu.edu Tuo Zhao Georgia Institute of Technology tourzhao@gatech.edu We use differential equations based approaches to provide some physics insights into analyzing the dynamics of popular optimization algorithms in machine learning. In particular, we study gradient descent, proximal gradient descent, coordinate gradient descent, proximal coordinate gradient, and Newton s methods as well as their Nesterov s accelerated variants in a unified framework motivated by a natural connection of optimization algorithms to physical systems. Our analysis is applicable to more general algorithms and optimization problems beyond convexity and strong convexity, e.g. Polyak-Łojasiewicz and error bound conditions (possibly nonconvex). 1 Introduction Many machine learning problems can be cast into an optimization problem of the following form: x = argmin x X f(x), (1.1) where X Rd and f : X R is a continuously differentiable function. For simplicity, we assume that f is convex or approximately convex (more on this later). Perhaps, the earliest algorithm for solving (1.1) is the vanilla gradient descent (VGD) algorithm, which dates back to Euler and Lagrange. VGD is simple, intuitive, and easy to implement in practice. For large-scale problems, it is usually more scalable than more sophisticated algorithms (e.g. Newton). Existing state-of-the-art analysis shows that VGD achieves an O(1/k) convergence rate for smooth convex functions and a linear convergence rate for strongly convex functions, where k is the number of iterations [11]. Recently, a class of Nesterov s accelerated gradient (NAG) algorithms have gained popularity in statistical signal processing and machine learning communities. These algorithms combine the vanilla gradient descent algorithm with an additional momentum term at each iteration. Such a modification, though simple, has a profound impact: the NAG algorithms attain faster convergence than VGD. Specifically, NAG achieves O(1/k2) convergence for smooth convex functions, and linear convergence with a better constant term for strongly convex functions [11]. Another closely related class of algorithms is randomized coordinate gradient descent (RCGD) algorithms. These algorithms conduct a gradient descent-type step in each iteration, but only with Work was done while the author was at Johns Hopkins University. This work is partially supported by the National Science Foundation under grant numbers 1546482, 1447639, 1650041 and 1652257, the ONR Award N00014-18-1-2364, the Israel Science Foundation grant #897/13, a Minerva Foundation grant, and by DARPA award W911NF1820267. Corresponding author. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. respect to a single coordinate. RCGD has similar convergence rates to VGD, but has a smaller overall computational complexity, since its computational cost per iteration of RCGD is much smaller than VGD [10, 7]. More recently, [5, 2] applied Nesterov s acceleration to RCGD, and proposed accelerated randomized coordinate gradient (ARCG) algorithms. Accordingly, they established similar accelerated convergence rates for ARCG. Another line of research focuses on relaxing the convexity and strong convexity conditions for alternative regularity conditions, including restricted secant inequality, error bound, Polyak-Łojasiewicz, and quadratic growth conditions. These conditions have been shown to hold for many optimization problems in machine learning, and faster convergence rates have been established (e.g. [8, 6, 9, 20, 3, 4]). Although various theoretical results have been established, the algorithmic proof of convergence and regularity conditions in these analyses rely heavily on algebraic tricks that are sometimes arguably mysterious to understand. To this end, a popular recent trend in the analysis of optimization algorithms has been to study gradient descent as a discretization of gradient flow; these approaches often provide a clear interpretation for the continuous approximation of the algorithmic systems [16, 17]. In [16], authors propose a framework for studying discrete algorithmic systems under the limit of infinitesimal time step. They show that Nesterov s accelerated gradient (NAG) algorithm can be described by an ordinary differential equation (ODE) under the limit that time step tends to zero. In [17], authors study a more general family of ODE s that essentially correspond to accelerated gradient algorithms. All these analyses, however, lack a natural interpretation in terms of physical systems behind the optimization algorithms. Therefore, they do not clearly explain why the momentum leads to acceleration. Meanwhile, these analyses only consider general convex conditions and gradient descent-type algorithms, and are NOT applicable to either the aforementioned relaxed conditions or coordinate-gradient-type algorithms (due to the randomized coordinate selection). Our Contribution (I): We provide novel physics-based insights into the differential equation approaches for optimization. In particular, we connect the optimization algorithms to natural physical systems through differential equations. This allows us to establish a unified theory for understanding optimization algorithms. Specifically, we consider the VGD, NAG, RCGD, and ARCG algorithms. All of these algorithms are associated with damped oscillator systems with different particle mass and damping coefficients. For example, VGD corresponds to a massless particle system while NAG corresponds to a massive particle system. A damped oscillator system has a natural dissipation of its mechanical energy. The decay rate of the mechanical energy in the system is connected to the convergence rate of the algorithm. Our results match the convergence rates of all algorithms considered here to those known in existing literature. We show that for a massless system, the convergence rate only depends on the gradient (force field) and smoothness of the function, whereas a massive particle system has an energy decay rate proportional to the ratio between the mass and damping coefficient. We further show that optimal algorithms such as NAG correspond to an oscillator system near critical damping. Such a phenomenon is known in the physical literature that the critically damped system undergoes the fastest energy dissipation. We believe that this view can potentially help us design novel optimization algorithms in a more intuitive manner. As pointed out by the anonymous reviewers, some of the intuitions we provide are also presented in [13]; however, we give a more detailed analysis in this paper. Our Contribution (II): We provide new analysis for more general optimization problems beyond general convexity and strong convexity, as well as more general algorithms. Specifically, we provide several concrete examples: (1) VGD achieves linear convergence under the Polyak-Łojasiewicz (PL) condition (possibly nonconvex), which matches the state-of-art result in [4]; (2) NAG achieves accelerated linear convergence (with a better constant term) under both general convex and quadratic growth conditions, which matches the state-of-art result in [19]; (3) Coordinate-gradient-type algorithms share the same ODE approximation with gradient-type algorithms, and our analysis involves a more refined infinitesimal analysis; (4) Newton s algorithm achieves linear convergence under the strongly convex and self-concordance conditions. See Table 1 for a summary. Due to space limitations, we present the extension to the nonsmooth composite optimization problem in Appendix. Table 1: Our contribution compared with [16, 17]. [15]/[16]/Ours VGD NAG RCGD ARCG Newton General Convex --/--/R R/R/R --/--/R --/--/R --/R/-- Strongly Convex --/--/R --/--/R --/--/R --/--/R --/--/R Proximal Variants --/--/R R/--/R --/--/R --/--/R --/--/R PL Condition --/--/R --/--/R --/--/R --/--/R --/--/-- Physical Systems --/--/R --/--/R --/--/R --/--/R --/--/R Recently, an independent work considered a framework similar to ours for analyzing the first-order optimization algorithms [18]; while the focus there is on bridging the gap between discrete algorithmic analysis and continuous approximation, we focus on understanding the physical systems behind the optimization algorithms. Both perspectives are essential and complementary to each other. Before we proceed, we first introduce assumptions on the objective f. Assumption 1.1 (L-smooth). There exists a constant L > 0 such that for any x, y Rd, we have f(x) f(y) L x y . Assumption 1.2 (µ-strongly convex). There exists a constant µ such that for any x, y Rd, we have f(x) f(y) + f(y), x y + µ 2 x y 2. Assumption 1.3 . (Lmax-coordinate-smooth) There exists a constant Lmax such that for any x, y Rd, we have | jf(x) jf(x\j, yj)| Lmax(xj yj)2 for all j = 1, ..., d. The Lmax-coordinate-smooth condition has been shown to be satisfied by many machine learning problems such as Ridge Regression and Logistic Regression. For convenience, we define κ = L/µ and κmax = Lmax/µ. Note that we also have Lmax L d Lmax and κmax κ dκmax. 2 From Optimization Algorithms to ODE We develop a unified representation for the continuous approximations of the aforementioned optimization algorithms. Our analysis is inspired by [16], where the NAG algorithm for general convex function is approximated by an ordinary differential equation under the limit of infinitesimal time step. We start with VGD and NAG, and later show that RCGD and ARCG can also be approximated by the same ODE. For self-containedness, we present a brief review for popular optimization algorithms in Appendix A (VGD, NAG, RCGD, ARCG, and Newton). 2.1 A Unified Framework for Continuous Approximation Analysis By considering an infinitesimal step size, we rewrite VGD and NAG in the following generic form: x(k) = y(k 1) η f(y(k 1)) and y(k) = x(k) + α(x(k) x(k 1)). (2.1) For VGD, α = 0; For NAG, α = 1/(µη)+1 when f is strongly convex, and α = k 1 k+2 when f is general convex. We then rewrite (2.1) as x(k+1) x(k) α x(k) x(k 1) + η f x(k) + α(x(k) x(k 1)) = 0. (2.2) When considering the continuous-time limit of the above equation, it is not immediately clear how the continuous-time is related to the step size k. We thus let h denote the time scaling factor and study the possible choices of h later on. With this, we define a continuous time variable t = kh with X(t) = x( t/h ) = x(k), (2.3) where k is the iteration index, and X(t) from t = 0 to t = is a trajectory characterizing the dynamics of the algorithm. Throughout the paper, we may omit (t) if it is clear from the context. Note that our definition in (2.3) is very different from [16], where t is defined as t = k η, i.e., fixing h = η. There are several advantages by using our new definition: (1) The new definition leads to a unified analysis for both VGD and NAG. Specifically, if we follow the same notion as [16], we need to redefine t = kη for VGD, which is different from t = k η for NAG; (2) The new definition is more flexible, and leads to a unified analysis for both gradient-type (VGD and NAG) and coordinate-gradient-type algorithms (RCGD and ARCG), regardless of their different step sizes, e.g η = 1/L for VGD and NAG, and η = 1/Lmax for RCGD and ARCG; (3) The new definition is equivalent to [16] only when h = η. We will show later that, however, h η is a natural requirement of a massive particle system rather than an artificial choice of h. We then proceed to derive the differential equation for (2.2). By Taylor expansion x(k+1) x(k) = X(t)h + 1 2 X(t)h2 + o(h), x(k) x(k 1) = X(t)h 1 2 X(t)h2 + o(h), and η f h x(k) + α x(k) x(k 1) i = η f(X(t)) + O(ηh). where X(t) = d X(t) dt and X(t) = d2X dt2 , we can rewrite (2.2) as (1 + α)h2 2η X(t) + (1 α)h η X(t) + f(X(t)) + O(h) = 0. (2.4) Taking the limit of h 0, we rewrite (2.4) in a more convenient form, m X(t) + c X(t) + f(X(t)) = 0. (2.5) Here (2.5) describes exactly a damped oscillator system in d dimensions with η as the particle mass, c := (1 α)h η as the damping coefficient, and f(x) as the potential field. Let us now consider how to choose h for different settings. The basic principle is that both m and c are finite under the limit h, η 0. In other words, the physical system is valid. Taking VGD as an example, for which we have α = 0. In this case, the only valid setting is h = Θ(η), under which, m 0 and c c0 for some constant c0. We call such a particle system massless. For NAG, it can also be verified that only h = Θ( η) results in a valid physical system and it is massive (0 < m < , 0 c < ). Therefore, we provide a unified framework of choosing the correct time scaling factor h. 2.2 A Physical System: Damped Harmonic Oscillator In classic mechanics, the harmonic oscillator is one of the first mechanic systems, which admit an exact solution. This system consists of a massive particle and restoring force. A typical example is a massive particle connecting to a massless spring. The spring always tends to stay at the equilibrium position. When it is stretched or compressed, there will be a force acting on the object that stretches or compresses it. The force is always pointing toward the equilibrium position. The energy stored in the spring is where X denotes the displacement of the spring, and K is the Hooke s constant of the spring. Here V (x) is called the potential energy in existing literature on physics. B : Equilibrium position m C Damping coefficient: c Figure 1: An illustration of the harmonic oscillators: A massive particle connects to a massless spring. (Top) Undamped harmonic oscillator; (Bottom) Damped harmonic oscillator. When one end of spring is attached to a fixed point, and the other end is attached to a freely moving particle with mass m, we obtain a harmonic oscillator, as illustrated in Figure 1. If there is no friction on the particle, by Newton s law, we write the differential equation to describe the system: m X + KX = 0 where X := d2X/dt2 is the acceleration of the particle. If we compress the spring and release it at point x0, the system will start oscillating, i.e., at time t, the position of the particle is X(t) = x0 cos(ωt), where ω = p K/m is the oscillating frequency. Such a system has two physical properties: (1) The total energy E(t) := V (X(t)) + K(X(t)) = V (x0) is always a constant, where K(X) := 1 2m X2 is the kinetic energy of the system. This is also called energy conservation in physics; (2) The system never stops. The harmonic oscillator is closely related to optimization algorithms. As we will show later, all our aforementioned optimization algorithms simply simulate a system, where a particle is falling inside a given potential. From a perspective of optimization, the equilibrium is essentially the minimizer of the quadratic potential function V (x) = 1 2Kx2. The desired property of the system is to stop the particle at the minimizer. However, a simple harmonic oscillator would not be sufficient and does not correspond to a convergent algorithm, since the system never stops: the particle at the equilibrium has the largest kinetic energy, and the inertia of the massive particle would drive it away from the equilibrium. One natural way to stop the particle at the equilibrium is adding damping to the system, which dissipates the mechanic energy, just like the real-world mechanics. A simple damping is a force proportional to the negative velocity of the particle (e.g. submerge the system in some viscous fluid) defined as Ff = c X, where c is the viscous damping coefficient. Suppose the potential energy of the system is f(x), then the differential equation of the system is, m X + c X + f(X) = 0. (2.6) For the quadratic potential, i.e., f(x) = K 2 x x 2, the energy exhibits exponential decay, i.e., E(t) exp( ct/(2m)) for under damped or nearly critical damped system (e.g. c2 4m K). For an over damped system (i.e. c2 > 4m K), the energy decay is For extremely over damping cases, i.e., c2 4m K, we have c c . This decay does not depend on the particle mass. The system exhibits a behavior as if the particle has no mass. In the language of optimization, the corresponding algorithm has linear convergence. Note that the convergence rate does only depend on the ratio c/m and does not depend on K when the system is under damped or critically damped. The fastest convergence rate is obtained, when the system is critically damped, c2 = 4m K. 2.3 Sufficient Conditions for Convergence For notational simplicity, we assume that x = 0 is a global minimum of f with f(x ) = 0. The potential energy of the particle system is simply defined as V (t) := V (X(t)) := f(X(t)). If an algorithm converges to optimal, a sufficient condition is that the corresponding potential energy V decreases over time. The decreasing rate determines the convergence rate of the corresponding algorithm. Theorem 2.1. Let γ(t) > 0 be a nondecreasing function of t and Γ(t) 0 be a nonnegative function. Suppose that γ(t) and Γ(t) satisfy d(γ(t)(V (t) + Γ(t))) dt 0 and lim t 0+ γ(t)(V (t) + Γ(t))) < . Then the convergence rate of the algorithm is characterized by 1 γ(t). Proof. By d(γ(t)(V (t)+Γ(t))) dt 0, we have γ(t)(V (t) + Γ(t)) γ(0+)(f(X(0+)) + Γ(0+)). This further implies f(X) V (t) + Γ(t) γ(0+)(f(X(0+))+Γ(0+)) In words, γ(t)[V (t) + Γ(t)] serves as a Lyapunov function of system. We say that an algorithm is (1/γ)-convergent, if the potential energy decay rate is O(1/γ). For example, γ(t) = eat corresponds to linear convergence, and γ = at corresponds to sublinear convergence, where a is a constant and independent of t. In the following section, we apply Theorem 2.1 to different problems by choosing different γ s and Γ s. 3 Convergence Rate in Continuous Time We derive the convergence rates of different algorithms for different families of objective functions. Given our proposed framework, we only need to find γ and Γ to characterize the energy decay. 3.1 Convergence Analysis of VGD We study the convergence of VGD for two classes of functions: (1) General convex function [11] has shown that VGD achieves O(L/k) convergence for general convex functions; (2) A class of functions satisfying the Polyak-Łojasiewicz (PŁ) condition, which is defined as follows [14, 4]. Assumption 3.1 . We say that f satisfies the µ-PŁ condition, if there exists a constant µ such that for any x Rd, we have 0 < f(x) f(x) 2 1 2µ. [4] has shown that the PŁ condition is the weakest condition among the following conditions: strong convexity (SC), essential strong convexity (ESC), weak strong convexity (WSC), restricted secant inequality (RSI) and error bound (EB). Thus, the convergence analysis for the PŁ condition naturally extends to all the above conditions. Please refer to [4] for more detailed definitions and analyses as well as various examples satisfying such a condition in machine learning. 3.1.1 Sublinear Convergence for General Convex Function By choosing Γ(t) = c X 2 2t and γ(t) = t, we have d(γ(t)(V (t) + Γ(t))) dt = f(X(t)) + t D f(X(t)), X(t) E + D X(t), c X(t) E = f(X(t)) f(X(t)), X(t) t c f(X(t)) 2 0, where the last inequality follows from the convexity of f. Thus, Theorem 2.1 implies f(X(t)) c x0 2 Plugging t = kh and c = h/η into (3.1) and set η = 1 L, we match the convergence rate in [11]: f(x(k)) c x0 2 2kh = L x0 2 3.1.2 Linear Convergence Under the Polyak-Łojasiewicz Condition Equation (2.5) implies X = 1 c f(X(t)). By choosing Γ(t) = 0 and γ(t) = exp 2µt c , we obtain d(γ(t)(V (t) + Γ(t))) dt = γ(t) 2µ c f(X(t)) + D f(X(t)), X(t) E c f(X(t)) 1 c f(X(t)) 2 . By the µ-PŁ condition: 0 < f(X(t)) f(X(t)) 2 1 2µ for some constant µ and any t, we have d(γ(t)(V (t) + Γ(t))) By Theorem 2.1, for some constant C depending on x0, we obtain f(X(t)) C exp 2µt which matches the behavior of an extremely over damped harmonic oscillator. Plugging t = kh and c = h/η into (3.3) and set η = 1 L, we match the convergence rate in [4]: f(xk) C exp 2µ for some constant C depending on x(0). 3.2 Convergence Analysis of NAG We study the convergence of NAG for a class of convex functions satisfying the Polyak-Łojasiewicz (PŁ) condition. The convergence of NAG has been studied for general convex functions in [16], and therefore is omitted. [11] has shown that NAG achieves a linear convergence for strongly convex functions. Our analysis shows that the strong convexity can be relaxed as it does in VGD. However, in contrast to VGD, NAG requires f to be convex. For a L-smooth convex function satisfying µ-PŁ condition, we have the particle mass and damping coefficient as m = h2 η and c = 2 µh η = 2 mµ. By [4], under convexity, PŁ is equivalent to quadratic growth (QG). Formally, we assume that f satisfies the following condition. Assumption 3.2 . We say that f satisfies the µ-QG condition, if there exists a constant µ such that for any x Rd, we have f(x) f(x ) µ We then proceed with the proof for NAG. We first define two parameters, λ and σ. Let γ(t) = exp(λct) and Γ(t) = m 2 X + σc X 2. Given properly chosen λ and σ, we show that the required condition in Theorem 2.1 is satisfied. Recall that our proposed physical system has kinetic energy m 2 X(t) 2. In contrast to an un-damped system, NAG takes an effective velocity X + σc X in the viscous fluid. By simple manipulation, d(V (t) + Γ(t)) dt = f(X), X + m X + σc X, X + σc X . We then observe exp( λct)td(γ(t)(V (t) + Γ(t))) dt = h λcf(X) + λcm 2 X + σc X 2 + d(V (t) + Γ(t)) h λc 1 + mσ2c2 f(X) + X, λcm 2 + mσc X + f(X) + m X + X, (λσmc2 + mσ2c2) X + mσc X i . Since c2 = 4mµ, we argue that if positive σ and λ satisfy m(λ + σ) = 1 and λ 1 + mσ2c2 then we guarantee d(γ(t)(V (t)+Γ(t))) dt 0. Indeed, we obtain 2 + mσc X + f(X) + m X = λmc 2 X 2 0 and X, (λσmc2 + mσ2c2) X + mσc X = σc X, f(X) . By convexity of f, we have λc 1+ mσ2c2 µ f(X) σc X, f(X) σcf(X) σc X, f(X) 0. To make (3.5) hold, it is sufficient to set σ = 4 5m and λ = 1 5m. By Theorem 2.1, we obtain f(X(t)) C exp ct for some constant C depending on x(0). Plugging t = hk, m = h2 η , c = 2 mµ, and η = 1 L into (3.6), we have that f(xk) C exp 2 Comparing with VGD, NAG improves the constant term on the convergence rate for convex functions satisfying PŁ condition from L/µ to p L/µ. This matches with the algorithmic proof of [11] for strongly convex functions, and [19] for convex functions satisfying the QG condition. 3.3 Convergence Analysis of RCGD and ARCG Our proposed framework also justifies the convergence analysis of the RCGD and ARCG algorithms. We will show that the trajectory of the RCGD algorithm converges weakly to the VGD algorithm, and thus our analysis for VGD directly applies. Conditioning on x(k), the updating formula for RCGD is x(k) i = x(k 1) i η if(x(k 1)) and x(k) \i = x(k 1) \i , (3.8) where η is the step size and i is randomly selected from {1, 2, . . . , d} with equal probabilities. Fixing a coordinate i, we compute its expectation and variance as E x(k) i x(k 1) i x(k) i = η d if x(k 1) and Var x(k) i x(k 1) i x(k) i = η2(d 1) if x(k 1) 2 . We define the infinitesimal time scaling factor h η as it does in Section 2.1 and denote e Xh(t) := x( t/h ). We prove that for each i [d], e Xh i (t) converges weakly to a deterministic function Xi(t) as η 0. Specifically, we rewrite (3.8) as, e Xh(t + h) e Xh(t) = η if( e Xh(t)). (3.9) Taking the limit of η 0 at a fix time t, we have |Xi(t + h) Xi(t)| = O(η) and 1 η E e Xh(t + h) e Xh(t) e Xh(t) = 1 d f( e Xh(t)) + O(h). Since f( e Xh(t)) 2 is bounded at the time t, we have 1 η Var e Xh(t+h) e Xh(t) e Xh(t) = O(h). Using an infinitesimal generator argument in [1], we conclude that e Xh(t) converges to X(t) weakly as h 0, where X(t) satisfies, X(t) + 1 d f(X(t)) = 0 and X(0) = x(0). Since η 1 Lmax , by (3.4), we have f(xk) C1 exp 2µ d Lmax k . for some constant C1 depending on x(0). The analysis for general convex functions follows similarly. One can easily match the convergence rate as it does in (3.2), f(x(k)) c x0 2 2kh = d Lmax x0 2 Repeating the above argument for ARCG, we obtain that the trajectory e Xh(t) converges weakly to X(t), where X(t) satisfies m X(t) + c X(t) + f(X(t)) = 0. For general convex function, we have m = h2 η and c = 3m t , where η = η d. By the analysis of [16], we have f(xk) C2d k2 , for some constant C2 depending on x(0) and Lmax. For convex functions satisfying µ-QG condition, m = h2 η and c = 2 p mµ d . By (3.7), we obtain f(xk) C3 exp 2 µ Lmax for some constant C3 depending on x(0). 3.4 Convergence Analysis for Newton Newton s algorithm is a second-order algorithm. Although it is different from both VGD and NAG, we can fit it into our proposed framework by choosing η = 1 L and the gradient as L 2f(X) 1 f(X). We consider only the case f is µ-strongly convex, L-smooth and ν-self-concordant. By (2.5), if h/η is not vanishing under the limit of h 0, we achieve a similar equation, C X + f(X) = 0, where C = h 2f(X) is the viscosity tensor of the system. In such a system, the function f not only determines the gradient field, but also determines a viscosity tensor field. The particle system is as if submerged in an anisotropic fluid that exhibits different viscosity along different directions. We release the particle at point x0 that is sufficiently close to the minimizer 0, i.e. x0 0 ζ for some parameter ζ determined by ν, µ, and L. Now we consider the decay of the potential energy V (X) := f(X). By Theorem 2.1 with γ(t) = exp( t 2h) and Γ(t) = 0, we have d(γ(t)f(X)) h f(X), ( 2f(X)) 1 f(X) . By simple calculus, we have f(X) = R 0 1 2f((1 t)X)dt X. By the self-concordance condition, we have (1 νt X X)2 2f(X) 2f((1 t)X)dt 1 (1 νt X X)2 2f(X), where v X = v T 2f(X)v [µ v 2 , L v 2]. Let β = νζL 1/2. By integration and the convexity of f, we have (1 β) 2f(X) Z 1 0 2f((1 t)X)dt 1 1 β 2f(X) 2f(X) f(X), ( 2f(X)) 1 f(X) 1 2 f(X), X 0. Note that our proposed ODE framework only proves a local linear convergence for Newton method under the strongly convex, smooth and self concordant conditions. The convergence rate contains an absolute constant, which does not depend on µ and L. This partially justifies the superior local convergence performance of the Newton s algorithm for ill-conditioned problems with very small µ and very large L. Existing literature, however, has proved the local quadratic convergence of the Newton s algorithm, which is better than our ODE-type analysis. This is mainly because the discrete algorithmic analysis takes the advantage of large step sizes, but the ODE only characterizes small step sizes, and therefore fails to achieve quadratic convergence. 4 Numerical Simulations We present an illustration of our theoretical analysis in Figure 2. We consider a strongly convex quadratic program 2x Hx, where H = 300 1 1 50 Obviously, f(x) is strongly convex and x = [0, 0] is the minimizer. We choose η = 10 4 for VGD and NAG, and η = 2 10 4 for RCGD and ARCG. The trajectories of VGD and NAG are obtained by the default method for solving ODE in MATLAB. 5 Discussions Figure 2: The algorithmic iterates and trajectories of a simple quadratic program. We then give a more detailed interpretation of our proposed system from a perspective of physics: Consequence of Particle Mass As shown in Section 2, a massless particle system (mass m = 0) describes the simple gradient descent algorithm. By Newton s law, a 0-mass particle can achieve infinite acceleration and has infinitesimal response time to any force acting on it. Thus, the particle is locked on the force field (the gradient field) of the potential (f) the velocity of the particle is always proportional to the restoration force acting on the particle. The convergence rate of the algorithm is only determined by the function f and the damping coefficient. The mechanic energy is stored in the force field (the potential energy) rather than in the kinetic energy. Whereas for a massive particle system, the mechanic energy is also partially stored in the kinetic energy of the particle. Therefore, even when the force field is not strong enough, the particle keeps a high speed. Damping and Convergence Rate For a quadratic potential V (x) = µ 2 x 2, the system has a exponential energy decay, where the exponent factor depends on mass m, damping coefficient c, and the property of the function (e.g. PŁ-conefficient). As discussed in Section 2, the decay rate is the fastest when the system is critically damped, i.e, c2 = 4mµ. For either under or over damped system, the decay rate is slower. For a potential function f satisfying convexity and µ-PŁ condition, NAG corresponds to a nearly critically damped system, whereas VGD corresponds to an extremely over damped system, i.e., c2 4mµ. Moreover, we can achieve different acceleration rate by choosing different m/c ratio for NAG, i.e., α = 1/(µη)s 1 1/(µη)s+1 for some absolute constant s > 0. However s = 1/2 achieves the largest convergence rate since it is exactly the critical damping: c2 = 4mµ. Connecting PŁ Condition to Hooke s law The µ-PŁ and convex conditions together naturally mimic the property of a quadratic potential V , i.e., a damped harmonic oscillator. Specifically, the µ-PŁ condition Hooke s constant Displacement Potential Energy Potential Energy of Spring guarantees that the force field is strong enough, since the left hand side of the above equation is exactly the potential energy of a spring based on Hooke s law. Moreover, the convexity condition V (x) V (x), X guarantees that the force field has a large component pointing at the equilibrium point (acting as a restoration force). As indicated in [4], PŁ is a much weaker condition than the strong convexity. Some functions that satisfy local PŁ condition do not even satisfy convexity, e.g., matrix factorization. The connection between the PŁ condition and the Hooke s law indicates that strong convexity is not the fundamental characterization of linear convergence. If there is another condition that employs a form of the Hooke s law, it should employ linear convergence as well. [1] Stewart N Ethier and Thomas G Kurtz. Markov processes: characterization and convergence, volume 282. John Wiley & Sons, 2009. [2] Olivier Fercoq and Peter Richtárik. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997 2023, 2015. [3] Pinghua Gong and Jieping Ye. Linear convergence of variance-reduced stochastic gradient without strong convexity. ar Xiv preprint ar Xiv:1406.1102, 2014. [4] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximalgradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795 811. Springer, 2016. [5] Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems, pages 3059 3067, 2014. [6] Ji Liu and Stephen J Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351 376, 2015. [7] Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152(1-2):615 642, 2015. [8] Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research, 46(1):157 178, 1993. [9] I Necoara, Yu Nesterov, and F Glineur. Linear convergence of first order methods for nonstrongly convex optimization. ar Xiv preprint ar Xiv:1504.06298, 2015. [10] Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012. [11] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. [12] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006. [13] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964. [14] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel noi Matematiki i Matematicheskoi Fiziki, 3(4):643 653, 1963. [15] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015. [16] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov s accelerated gradient method: theory and insights. In Advances in Neural Information Processing Systems, pages 2510 2518, 2014. [17] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. ar Xiv preprint ar Xiv:1603.04245, 2016. [18] Ashia C Wilson, Benjamin Recht, and Michael I Jordan. A lyapunov analysis of momentum methods in optimization. ar Xiv preprint ar Xiv:1611.02635, 2016. [19] Hui Zhang. New analysis of linear convergence of gradient-type methods via unifying error bound conditions. ar Xiv preprint ar Xiv:1606.00269, 2016. [20] Hui Zhang and Wotao Yin. Gradient methods for convex minimization: better rates under weaker conditions. ar Xiv preprint ar Xiv:1303.4645, 2013.