# on_the_curved_geometry_of_accelerated_optimization__6fa1e572.pdf On the Curved Geometry of Accelerated Optimization Aaron Defazio Facebook AI Research New York In this work we propose a differential geometric motivation for Nesterov s accelerated gradient method (AGM) for strongly-convex problems. By considering the optimization procedure as occurring on a Riemannian manifold with a natural structure, The AGM method can be seen as the proximal point method applied in this curved space. This viewpoint can also be extended to the continuous time case, where the accelerated gradient method arises from the natural block-implicit Euler discretization of an ODE on the manifold. We provide an analysis of the convergence rate of this ODE for quadratic objectives. 1 Introduction The core algorithms of convex optimization are gradient descent (GD) and the accelerated gradient method (AGM). These methods are rarely used directly, more often they occur as the building blocks for distributed, composite, or non-convex optimization. In order to build upon these components, it is helpful to understand not just how they work, but why. The gradient method is well understood in this sense. It is commonly viewed as following a direction of steepest descent or as minimizing a quadratic upper bound. These interpretations provide a motivation for the method as well as suggesting a potential convergence proof strategy. The accelerated gradient method in contrast has an identity crisis. Its equational form is remarkably malleable, allowing for many different ways of writing the same updates. We list a number of these forms in Table 1. Nesterov s original motivation for the AGM method used the concept of estimate sequences. Unfortunately, estimate sequences do not necessarily yield the simplest accelerated methods when generalized, such as for the composite case (Beck and Teboulle 2009, Nesterov 2007), and they have not been successfully applied in the important finite-sum (variance reduced) optimization setting. Because of the complexity of estimate sequences, the AGM method is commonly motivated as a form of momentum. This view is flawed as a way of introducing the AGM method from first principles, as the most natural way of using momentum yields the heavy ball method instead: xk+1 = xk γ f xk + β xk xk 1 , which arises from discretizing the physics of a particle in a potential well with additional friction. The heavy-ball method does not achieve an accelerated convergence rate on general convex problems, suggesting that momentum, per se, is not the reason for acceleration. Another contemporary view is the linear-coupling interpretation of Allen-Zhu and Orecchia [2017], which views the AGM method as an interpolation between gradient descent and mirror descent. We take a more geometric view in our interpretation. In this work we motivate the AGM by introducing it as an application of the proximal-point method: xk = arg min x x xk 1 2o . 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. The proximal point (PP) method is perhaps as foundational as the gradient descent method, although it sees even less direct use as each step requires solving a regularized subproblem, in contrast to the closed form steps for GD and AGM. The PP method gains remarkable convergence rate properties in exchange for the computational difficulty, including convergence for any positive step-size. We construct the AGM by applying a dual form of the proximal point method in a curved space. Each step follows a geodesic on a manifold in a sense we make precise in Section 4. We use the term curved with respect to a coordinate system, rather than a coordinate free notion of curvature such as the Riemannian curvature. We first give a brief introduction to the concepts from differential geometry necessary to understand our motivation. The equational form that our argument yields is much closer to those that have been successfully applied in practice, particularly for the minimization of finite sums [Lan and Zhou, 2017, Zhang and Xiao, 2017]. 2 Connections An (affine) connection is a type of structure on a manifold that can be used to define and compute geodesics. Geodesics in this sense represent curves of zero acceleration. These geodesics are more general concepts than Riemannian geodesics induced by the Riemannian connection, not necessarily representing the shortest path in any metric. Indeed, we will define multiple connections on the same manifold that lead to completely different geodesics. Given a n dimensional coordinate system, a connection is defined by n3 numbers at every point x on the manifold, called the connection coefficients (or Christoffel symbols) Γ k ij(x). A geodesic is a path γ : [0, 1] M (in our case M = Rn) between two points x and y can then be computed as the unique solution γ(t) = x(t) to the system of ordinary differential equations [Lee, 1997, Page 58, Eq 4.11]: d2γi dt2 .= d2xi j,k Γ i jk(x)dxj with boundary conditions x(0) = x and x(1) = y. Here xi denotes the ith component of x expressed in the same coordinate system as the connection. 3 Divergences induce Hessian manifold structure Let φ be a smooth strongly convex function defined on Rn. The Bregman divergence generated by φ: Bφ(x, y) = φ(x) φ(y) φ(y), x y , and its derivatives can be used to define a remarkable amount of structure on the domain of φ. In particular, we can define a Riemannian manifold, together with two dually flat connections with biorthogonal coordinate systems [Amari and Nagaoka, 2000, Shima, 2007]. This structure is also known as a Hessian manifold. Topologically it is M = Rn with the following additional geometric structures. Riemannian structure Riemannian manifolds have the additional structure of a metric tensor (a generalized dot-product), defined on their tangent spaces. We denote the vector space of tangent vectors at a point x as Tx M. If we express the tangent vectors with respect to the Euclidean basis, the metric at a point x is a quadratic form with the Hessian matrix H(x) = 2 x B(x, y) = 2φ(x) of φ at x: gx(u, v) = u T H(x)v. Biorthogonal coordinate systems Central to the notion of a manifold is the invariance to the choice of coordinate system. We can express a point on the manifold as well as a point in the tangent space using any coordinate system that is most convenient. Of course, when we wish to perform calculations on the manifold we must be careful to express all quantities in that coordinate system. Euclidean coordinates ei are the most natural on our Hessian manifold, however there is another coordinate system which is naturally dual to ei, and ties the manifold structure directly to duality theory in optimization. Table 1: Equivalent forms of Nesterov s method for µ-strongly convex, L-smooth f. Proofs of the stated relations are available in the appendix. Form Name Algorithm Relations Nesterov [2013] form I yk = αγvk + γxk xk+1 = yk 1 vk+1 = (1 α) vk + αµ µ/L γNes = µ. Nesterov [2013] form II xk+1 = yk 1 yk+1 = xk+1 + β xk+1 xk βNes = Sutskever et al. [2013] pk+1 = βpk 1 L f xk + βpk , xk+1 = xk + pk+1 pk+1 Sut = xk+1 Nes xk Nes, yk Nes = xk Sut + βpk Sut. Modern Momentum1 pk+1 = βpk + f(xk), xk+1 = xk 1 f(xk) + βpk+1 . xk mod = xk Sut + βpk Sut = yk Nes, pk mod = Lpk Sut. Auslender and Teboulle [2006] yk = (1 θ)ˆxk + θzk, zk+1 = zk γ ˆxk = (1 θ)ˆxk + θzk+1. θAT = 1 βNes, ˆxk AT = xk Nes, yk AT = yk Nes = xk mod, γAT = 1/L. Lan and Zhou [2017] xk = α(xk 1 xk 2) + xk 1, xk = xk + τxk 1 gk = f(xk), xk = xk 1 1 xk Lan = zk AT, xk Lan = yk AT, τLan = 1 θAT αLan = 1 θAT. Recall that for a convex function φ we may define the convex conjugate φ (y) = maxx { x, y φ(x)} . The dual coordinate system we define simply identifies each point x, when expressed in Euclidean ( primal ) coordinates, with the vector of dual coordinates: Our assumptions of smoothness and strong convexity imply this is a one-to-one mapping, with inverse given by x = φ (y). The remarkable fact that the gradient of the conjugate is the inverse of the gradient is a key building block of the theory in this paper. The notion of biorthogonality refers to natural tangent space coordinates of these two systems. A tangent vector v at a point x can be converted to a vector u of dual (tangent space) coordinates using matrix multiplication with the Hessian [Shima, 2007]: u = H(x)v, (1) Given the definition of the metric above, it is easy to see that if we have two vectors v1 and v2, we may express v2 in dual coordinates u2 so that the metric tensor takes the simple form: gx(v1, v2) = v T 1 H(x)v2 = v T 1 H(x) H(x) 1u2 = v T 1 u2, which is the biorthogonal relation between the two tangent space coordinate systems. Dual Flat Connections There is an obvious connection Γ (E) we can apply to the Hessian manifold, the Euclidean connection that trivially identifies straight lines in Rn as geodesics. Normally when we perform gradient descent 1Py Torch & Tensorflow (for instance) implement this form. Evaluating the gradient and function at the current iterate xk instead of a shifted point makes it more consistent with gradient descent when wrapped in a generic optimization layer. 0.2 0.1 0.0 0.1 0.2 0.3 Euclidean (primal) coordinates Riemannian geodesic Dual connection geodesic Euclidean geodesic 0.20 0.15 0.10 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.4 Dual coordinates Riemannian geodesic Dual connection geodesic Euclidean geodesic Figure 1: Illustrative geodesics for f(x) = 1 4 Ax 4 , with A = [2, 1; 1, 3]. Viewing them from both coordinate systems highlights the duality. Contour lines are for f and f respectively. in Rn we are implicitly following a geodesic of this connection. The connection coefficients Γ (E)k ij are all zero when this connection is expressed in Euclidean coordinates. A connection that has Γ k ij = 0 with respect to some coordinate system is a flat connection. The Hessian manifold admits another flat connection, which we will call the dual connection, as it corresponds to straight lines in the dual coordinate system established above. In particular each dual geodesic can be expressed in primal coordinates γ(t) as a solution to the equation: φ (γ(t)) = at + b, for vectors a, b representing the initial velocity and point respectively (both represented in dual coordinates) that depend on the boundary conditions. This is quite easy to solve using the relation φ 1 = φ discussed above. For instance, a geodesic γ : [0, 1] M between two arbitrary points x and y under the dual connection could be computed explicitly in Euclidean coordinates as: γ(t) = φ (t φ(y) + (1 t) φ(x)) . If we instead know the initial velocity we can find the endpoint with: y = φ φ(x) + H(xk)v . (2) The flatness of the dual connection Γ (D) is crucial to its computability in practice. If we instead try to compute the geodesic in Euclidean coordinates using the geodesic ODE, we have to work with the connection coefficients which involve third derivatives of φ (taking the form of double those of the Riemannian connection Γ (R)): Γ (D)k ij (x) = 2Γ (R)k ij = H(x) 1 ( H(x))i The Riemannian connection s geodesics are similarly difficult to compute directly from the ODE (they also can t generally be expressed in a simpler form). 4 Bregman proximal operators follow geodesics Bregman divergences arise in optimization primarily through their use in proximal steps. A Bregman proximal operation balances finding a minimizer of a given function f with maintaining proximity to a given point y, measured using a Bregman divergence instead of a distance metric: xk = arg min x f(x) + ρBφ(x, xk 1) . (3) A core application of this would be the mirror descent step [Nemirovski and Yudin, 1983, Beck and Teboulle, 2003], where the operation is applied to a linearized version of f instead of f directly: xk = arg min x x, f(xk 1) + ρBφ(x, xk 1) . Bregman proximal operations can be interpreted as geodesic steps with respect to the dual connection. The key idea is that given an input point xk 1, they output a point x such that the velocity of the connecting geodesic is equal to 1 ρf(x) at x. This velocity is measured in the flat coordinate system of the connection, the dual coordinates. To see why, consider a geodesic γ(t) = (1 t) φ(xk 1) + t φ(xk). Here xk 1 and xk are in primal coordinates and γ(t) is in dual coordinates. The velocity is d dtγ(t) = φ(xk) φ(xk 1). Contrast to the optimality condition of the Bregman prox (Equation 3): ρ f(xk) = φ(xk) φ(xk 1). For instance, when using the Euclidean penalty the step is: xk = arg minx f(x) + ρ 2 x xk 1 2 . The final velocity is just xk xk 1, and so xk xk 1 = 1 ρ f(xk), which is the solution of the proximal operation. 5 Primal-Dual form of the proximal point method The proximal point method is the building block from which we will construct the accelerated gradient method. Consider the basic form of the proximal point method applied to a strongly convex function f. At each step, the iterate xk is constructed from xk 1 by solving the proximal operation subproblem given an inverse step size parameter η: xk = arg min x x xk 1 2o . (4) This step can be considered an implicit form of the gradient step, where the gradient is evaluated at the end-point of the step instead of the beginning: xk = xk 1 1 which is just the optimality condition of the subproblem in Equation 4, found by taking the derivative f(x) + ηx ηxk 1 to be zero. A remarkable property of the proximal operation becomes apparent when we rearrange this formula, namely that the solution to the operation is not a single point but a primal-dual pair, whose weighted sum is equal to the input point: η f(xk) = xk 1. If we define gk = f(xk), the primal-dual pair obeys a duality relation: gk = f(xk) and xk = f (gk), which allows us to interchange primal and dual quantities freely. Indeed we may write the condition in a dual form as: η gk = xk 1, (5) which is the optimality condition for the proximal operation: gk = arg min g g ηxk 1 2 . Our goal in this section is to express the proximal point method in terms of a dual step, and while this equation involves the dual function f , it is not a step in the sense that gk is formed by a proximal operation from gk 1. We can manipulate this formula further to get an update of the form we want, by simply adding and subtracting gk 1 from 5: η gk 1 + xk 1 1 Which gives the updates: gk = arg min g f (g) g, xk 1 1 xk = xk 1 1 We call this the primal-dual form of the proximal point method. 6 Acceleration as a change of geometry The proximal point method is rarely used in practice due to the difficulty of computing the solution to the proximal subproblem. It is natural then to consider modifications of the subproblem to make it more tractable. The subproblem becomes particularly simple if we replace the proximal operation with a Bregman proximal operation with respect to f , gk = arg min g f (g) g, xk 1 1 η gk 1 + τBf (g, gk 1) . We have additionally changed the penalty parameter to a new constant τ, which is necessary as the change to the Bregman divergence changes the scaling of distances. We discuss this further below. Recall from Section 4 that Bregman proximal operations follow geodesics. The key idea is that we are now following a geodesic in the dual connection of φ = f , using the notation of Section 3, which is a straight-line in the primal coordinates of f due to the flatness of the connection (Section 3). Due to the flatness property, a simple closed-form solution can be derived by equating the derivative to 0: f (gk) xk 1 1 η gk 1 + τ f (gk) τ f (gk 1) = 0, therefore gk = f (1 + τ) 1 xk 1 1 η gk 1 + τ f (gk 1) . This formula gives gk in terms of the derivative of known quantities, as f (gk 1) is known from the previous step as the point at which we evaluated the derivative at. We will denote this argument to the derivative operation y, so that gk = f(yk). It no longer holds that gk = f(xk) after the change of divergence. Using this relation, y can be computed each step via the update: yk = xk 1 1 ηgk 1 + τyk 1 In order to match the accelerated gradient method exactly we need some additional flexibility in the step size used in the yk update. To this end we introduce an additional constant α in front of gk 1, which is 1 for the proximal point variant. The full method is as follows: Bregman form of the accelerated gradient method yk = xk 1 α η gk 1 + τyk 1 gk = f(yk), xk = xk 1 1 This is very close to the equational form of Nesterov s method explored by Lan and Zhou [2017], with the change that they assume an explicit regularizer is used, whereas we assume strong convexity of f. Indeed we have chosen our notation so that the constants match. This form is algebraically equivalent to other known forms of the accelerated gradient method for appropriate choice of constants. Table 1 shows the direct relation between the many known ways of writing the accelerated gradient method in the strongly-convex case (Proofs of these relations are in the Appendix). When f is µ-strongly convex and L-smooth, existing theory implies an accelerated geometric convergence rate of at least 1 p µ L for the parameter settings [Nesterov, 2013]: η = µL, τ = L η , α= τ 1+τ . In contrast, the primal-dual form of the proximal point method achieves at least that convergence rate for parameters: η = µL, τ = 1 The difference in τ arises from the difference in the scaling of the Bregman penalty compared to the Euclidean penalty. The Bregman generator f is strongly convex with constant 1/L whereas the Euclidean generator 1 2 2 is strongly convex with constant 1, so the change in scale requires rescaling by L. 6.1 Interpretations After the change in geometry, the g update no longer gives a dual point that is directly the gradient of the primal iterate. However, notice that the term we are attempting to minimize in the g step: f (g) g, xk 1 α has a fixed point of f gk = xk 1 α η gk, which is precisely an α-weight version of the proximal point s key property from Equation 5. Essentially we have relaxed the proximal-point method. Instead of this relation holding precisely at every step, we are instead constantly taking steps which pull g closer to satisfying it. 6.2 Inertial form The primal-dual view of the proximal point method can also be written in terms of the quantity zk 1 = xk 1 α η gk 1 instead of xk 1. This form is useful for the construction of ODEs that model the discrete dynamics. Under this change of variables the updates are: gk = arg min g f (g) g, zk 1 + 1 zk = zk 1 1 η gk gk 1 . (7) 6.3 Relation to the heavy ball method Consider Equation 6 with α = 0, which removes the over-extrapolation before the proximal operation. If we define β = τ 1+τ we may write the method as: xk = xk 1 1 η f (yk 1), yk = βyk 1 + (1 β) xk. We can eliminate xk from the yk update above by plugging in the xk step equation, then using the yk update from the previous step in the form (1 β) xk 1 = yk 1 βyk 2 : yk = βyk 1 + (1 β) xk 1 1 = βyk 1 (1 β) 1 η f (yk 1) + yk 1 βyk 2 = yk 1 (1 β) 1 η f (yk 1) + β yk 1 yk 2 . This has the exact form of the heavy ball method with step size (1 β) /η and momentum β. We can also derive the heavy ball method by starting from the saddle-point expression for f: min x f(x) = min x max g { x, g f (g)} . The alternating-block gradient descent/ascent method on the objective x, g f (g) with step-size γ is simply: gk = gk 1 + 1 γ xk 1 f (gk 1) , xk = xk 1 γgk. If we instead perform a Bregman proximal update in the dual geometry for the g part, we arrive at the same equations as we had for the primal-dual proximal point method but with α = 0, yielding the heavy ball method. In order to get the accelerated gradient method instead of the heavy ball method, the extra inertia that arises from starting from the proximal point method instead of the saddle point formulation appears to be crucial. 7 Dual geometry in continuous time The inertial form (Equation 7) of the proximal point method can be formulated as an ODE in a very natural way, by mapping zk zk 1 z and gk gk 1 g, and taking x and g to be at time t. This is the inverse of the Euler class of discretizations applied separately to the two terms, which is the most natural way to discretize an ODE. The resulting proximal point ODE is: g = fg(z, g, t) .= 1 τ f (g) + 1 z = fz(z, g, t) .= 1 We have suppressed the dependence on t of each quantity for notational simplicity. We can treat g more formally as a point g M on a Hessian manifold M. Then the solution for the g variable of the ODE is a curve γ(t) : I T M from an interval I to the tangent bundle on the manifold so the velocity γ(t) Tg M obeys the ODE: γ(t) = fg(z, g, t). The right hand side of the ODE is a point in the tangent space of the manifold at γ(t), expressed in Euclidean coordinates. We can now apply the same partial change of geometry that we used in the discrete case. We will consider the quantity fg(z, g, t) to be a tangent vector in dual tangent space coordinates For the φ = f Hessian manifold, instead of its primal tangent space coordinates (which would leave the ODE unchanged). The variable g remains in primal coordinates with respect to φ, so we must add to the ODE a change of coordinates for the tangent vector, yielding: g = 2f (g) 1 fg(z, g, t), where we have used the inverse of Equation 1, with φ = f . We can rewrite this as: fg(z, g, t) = 2f (g) g = d giving the AGM ODE system: d dt f (g) = 1 τ f (g) + 1 It is now easily seen that the implicit Euler update for the g variable with z fixed now corresponds to the solution of the Bregman proximal operation considered in the discrete case. So this ODE is a natural continuous time analogue to the accelerated gradient method. Convergence in continuous time 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 Prox ODE AGM ODE Prox AGM Figure 2: Paths for the quadratic problem f(x) = 1 2x T Ax with A = [2, 1; 1, 3]. The natural analogy to convergence in continuous time is known as the decay rate of the ODE. A sufficient condition for an ODE with parameters u = [z; g] to decay with constant ρ is: u(t) u exp ( tρ) u(0) u , where u is a fixed point. We can relate this to the discrete case by noting that exp( tρ) = limk (1 t kρ)k, so given our discrete-time convergence rate is proportional to (1 p µ/L)k, we would expect values of ρ proportional to p µ/L if the ODE behaves similarly to the discrete process. We have been able to establish this result for both the proximal and AGM ODEs for quadratic objectives (proof in the Appendix in the supplementary material). Theorem 1. The proximal and AGM ODEs decay with at least the following rates for µ-strongly convex and Lsmooth quadratic objective functions when using the same hyper-parameters as in the discrete case: ρprox µ µ+ Figure 2 contrasts the convergence of the discrete and continuous variants. The two methods have quite distinct paths whose shape is shared by their ODE counterparts. 8 Related Work The application of Bregman divergence to the analysis of continuous time views of the accelerated gradient method has recently been explored by Wibisono et al. [2016] and Wilson et al. [2018]. Their approaches do not use the Bregman divergence of f , a key factor of our approach. The Bregman divergence of a function φ occurs explicitly as a term in a Hamiltonian, in contrast to our view of φ as curving space. The accelerated gradient method has been shown to be modeled by a momentum of the form ODE X + c(t) X + f(x) = 0 by Su et al. [2014]. Natural discretizations of their ODE result in the heavy-ball method instead of the accelerated gradient method, unlike our form which can produce both based on the choice of α. The fine-grained properties of momentum ODEs have also been studied in the quadratic case by Scieur et al. [2017]. A primal-dual form of the regularized accelerated gradient method appears in Lan and Zhou [2017]. Our form can be seen as a special case of their form when the regularizer is zero. Our work extends theirs, providing an understanding of the role that geometry plays in unifying acceleration and implicit steps. The Riemannian connection induced by a function has been heavily explored in the optimization literature as part of the natural gradient method [Amari, 1998], although other connections on this manifold are less explored. The dual-flat connections have primarily seen use in the informationgeometry setting for optimization over distributions [Amari and Nagaoka, 2000]. The accelerated gradient method is not the only way to achieve accelerated rates among first order methods. Other techniques include the Geometric descent method of Bubeck et al. [2015], where a bounding ball is updated at each step that encloses two other balls, a very different approach. The method described by Nemirovski and Yudin [1983] is also notable as being closer to the conjugate gradient method than other accelerated approaches, but at the expense of requiring a 2D search instead of a 1D line search at each step. 9 Conclusion We believe the tools of differential geometry may provide a new and insightful avenue for the analysis of accelerated optimization methods. The analysis we provide in this work is a first step in this direction. The advantage of the differential geometric approach is that it provides high level tools that make the derivation of acceleration easier to state. This derivation, from the proximal point method to the accelerated gradient method, is in our opinion not nearly as mysterious as the other known approaches to understanding acceleration. Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS 17, 2017. Full version available at http://arxiv.org/abs/1407.1537. Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251 276, 1998. Shun-Ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry. Oxford University Press, 2000. Alfred Auslender and Marc Teboulle. Interior gradient and proximal methods for convex and conic optimization. SIAM Journal on Optimization, 16(3):697 725, 2006. Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 2003. Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. IMAGING SCIENCES, 2009. Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan. Linear Matrix Inequalities in System and Control Theory. Society for Industrial and Applied Mathematics, 1994. Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to nesterov s accelerated gradient descent, 2015. Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical programming, pages 1 49, 2017. John M. Lee. Riemannian Manifolds : An introduction to curvature. Springer-Verlag, 1997. Arkadi Nemirovski and David Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983. Yu. Nesterov. Accelerating the cubic regularization of newton s method on convex problems. Mathematical Programming, 2008. Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). In Soviet Mathematics Doklady, volume 27, pages 372 376, 1983. Yurii Nesterov. Gradient methods for minimizing composite objective function,. Technical report, CORE, 2007. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. Ernest K Ryu and Stephen Boyd. Primer on monotone operator methods. Appl. Comput. Math, 15(1): 3 43, 2016. Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d Aspremont. Integration methods and optimization algorithms. In Advances in Neural Information Processing Systems 30, 2017. Hirohiko Shima. The Geometry of Hessian Structures. World Scientific, 2007. Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. Advances in Neural Information Processing Systems, 2014. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139 1147, 2013. Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351 E7358, 2016. Ashia C Wilson, Benjamin Recht, and Michael I Jordan. A Lyapunov analysis of momentum methods in optimization. ar Xiv preprint ar Xiv:1611.02635, 2018. Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. The Journal of Machine Learning Research, 18(1):2939 2980, 2017.