# discrete_variational_calculus_for_accelerated_optimization__812c24a3.pdf Journal of Machine Learning Research 24 (2023) 1-33 Submitted 11/21; Revised 11/22; Published 1/23 Discrete Variational Calculus for Accelerated Optimization C edric M. Campos cedric.mcampos@urjc.es Departamento de Matem atica Aplicada, Ciencia e Ingenier ıa de los Materiales y Tecnolog ıa Electr onica Universidad Rey Juan Carlos Calle Tulip an s/n, 28933 M ostoles, Spain Alejandro Mahillo almahill@unizar.es Departamento de Matem aticas Instituto Universitario de Matem aticas y Aplicaciones Universidad de Zaragoza Calle de Pedro Cerbuna 12, 50009 Zaragoza, Spain David Mart ın de Diego david.martin@icmat.es Instituto de Ciencias Matem aticas (CSIC-UAM-UC3M-UCM) Calle Nicol as Cabrera 13-15, 28049 Madrid, Spain Editor: Sayan Mukluk Many of the new developments in machine learning are connected with gradient-based optimization methods. Recently, these methods have been studied using a variational perspective (Betancourt et al., 2018). This has opened up the possibility of introducing variational and symplectic methods using geometric integration. In particular, in this paper, we introduce variational integrators (Marsden and West, 2001) which allow us to derive different methods for optimization. Using both Hamilton s and Lagrange-d Alembert s principle, we derive two families of optimization methods in one-to-one correspondence that generalize Polyak s heavy ball (Polyak, 1964) and Nesterov s accelerated gradient (Nesterov, 1983), the second of which mimics the behavior of the latter reducing the oscillations of classical momentum methods. However, since the systems considered are explicitly time-dependent, the preservation of symplecticity of autonomous systems occurs here solely on the fibers. Several experiments exemplify the result. Keywords: Polyak s heavy ball, Nesterov s accelerated gradient, momentum methods, variational integrators, Bregman Lagrangians 1. Introduction Many of the literature on machine learning and data analysis is connected with gradientbased optimization methods (see Polak, 1997; Nesterov, 2018; and references therein). The computations often involve large data and parameter sets and then, not only the computational efficiency is a crucial point, but the optimization theory also plays a fundamental role. A typical optimization problem is: argmin f(x) , x Q , (1.1) 2023 C edric M. Campos, Alejandro Mahillo and David Mart ın de Diego. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1323.html. C.M. Campos, A. Mahillo and D. Mart ın de Diego where we assume that Q is a convex set in Rn and f is a continuously differentiable convex function with Lipschitzian gradient. In this case, one of the most extended algorithms to achieve (1.1) is Nesterov s accelerated gradient (Nesterov, 1983; Su et al., 2016) which may take the following form: yk+1 = xk η f(xk) , xk+1 = yk+1 + k k + 3(yk+1 yk) , starting from an initial condition x0 (see more details in Sections 2 and 7). An important observation was made by Su et al. (2016) showing that the continuous limit of Nesterov s method is a time-dependent second order differential equation. Moreover, Wibisono et al. (2016) show that this system of differential equations has a variational origin (see also Wibisono, 2016). In particular, they take as point of departure this variational approach that captures acceleration in continuous time considering a particular type of timedependent Lagrangian functions, called Bregman Lagrangians (see Section 3). In a recent paper, Betancourt et al. (2018) introduce symplectic (and presymplectic) integrators for the differential equations associated with accelerated optimizations methods (see Sanz-Serna and Calvo, 1994; Hairer et al., 2010; Blanes and Casas, 2016, for an introduction to symplectic integration). They use the Hamiltonian formalism since it is possible to extend the phase space to turn the system into a time-independent Hamiltonian system and apply there standard symplectic techniques (see Marthinsen and Owren, 2016; Celledoni et al., 2020). For recent improvements of this approach using adaptive Hamiltonian variational integrators, see Duruisseaux et al. (2021). In our paper we set an alternative route: The idea is to use variational integrators adapted to an explicit time-dependent framework and external forces (see Marsden and West, 2001, and references therein) to derive a whole family of optimizations methods. The theory of discrete variational mechanics has reached maturity in recent years by combining results of differential geometry, classical mechanics, and numerical integration. Roughly speaking, the continuous Lagrangian L: TQ R is substituted by a discrete Lagrangian Ld : Q Q R. Observe that, by replacing the standard velocity phase space TQ with Q Q, we are discretizing a velocity vector by two (in principle) close points. With the unique information of the discrete Lagrangian we can define the discrete action sum and, applying standard variational techniques, we derive a system of second order difference equations known as discrete Euler-Lagrange equations. The numerical order of the methods is obtained using variational error analysis (see Marsden and West, 2001; Patrick and Cuell, 2009). Moreover, it is possible to derive a discrete version of Noether s theorem relating the symmetries of the discrete Lagrangian with conserved quantities. The derived methods are automatically symplectic and, perhaps more importantly, easily adapted to other situations as, for instance, Lie group integrators, time-dependent Lagrangians, forced systems, optimal control theory, holonomic and nonholonomic mechanics, field theories, etc. The Lagrangian functions described in Section 3, Bregman Lagrangians, are those explicitly time-dependent that typically arise on accelerated optimization. The geometry for time-dependent systems is different from symplectic geometry, in particular, the phase space is odd dimensional. In this case, an appropriate geometric framework is given by cosymplectic geometry (see Libermann, 1959; Cappelletti-Montano et al., 2013; and ref- Discrete Variational Calculus for Accelerated Optimization erences therein). In Section 4, we introduce the cosymplectic structure associated with a time-dependent Hamiltonian system (induced by a time-dependent Lagrangian) and also an interesting symplectic preservation property associated with the restriction of the Hamiltonian flow to the fibers of the projection onto the time variable (Theorem 1). Having in mind this geometrical framework, we introduce in Section 5 discrete variational mechanics for time-dependent Lagrangians with fixed time step (compare with Marsden and West, 2001, for variable time step). Moreover, we recover the symplectic character on fibers of the continuous Hamiltonian flow. We show the feasibility of constructing variational integrators using similar techniques to the developed for the autonomous case that, in some interesting cases, are in addition explicit and, consequently, reduce the computational cost. An example of such methods is the second-order difference equation xk+1 = xk ηk f(xk) + µk(xk xk 1) , a type of momentum-descent method widely studied in the literature and whose origin goes back to Polyak (1964). Momentum methods allow to accelerate gradient descent by taking into account the speed achieved by the method at the last update. However, because of that speed, momentum methods can overpass the minimum. Nesterov s method tries to anticipate future information reducing the typical oscillations of classical momentum methods towards the minimum. In Section 6 we adapt our construction of variational integrators to add external forces using discrete Lagrange-d Alembert s principle (see Marsden and West, 2001). Upon this machinery, we derive in Section 7 two families of momentum methods in mutual bijective correspondence, one of which corresponds to Nesterov s method (see Theorem 6). Finally, for Section 8, many methods and numerical simulations have been implemented in Julia v1.8.2. We optimize several test functions with our methodology and other methods that appeared recently in the literature. One of the test functions is reused afterwards for a machine learning example. 2. From Gradient Descent to Nesterov s Accelerated Gradient In this section we give a historical perspective of Nesterov s accelerated gradient from gradient descent with a threefold objective: First, properly introduce the methods of interest and their properties, second, give an overall view of the elements to take under consideration, and, third, set some of the notation. Although the first method that comes to mind to solve the optimization problem (1.1) is Newton-Raphson, the first dynamical one is due to Cauchy (1847). His method, known as Gradient Descent (GD), is the one-step method xk+1 = xk η f(xk) , (2.1) where η is the step size parameter or, as it is referred in the machine learning community, the learning rate. It is readily seen that this method is a simple discretization of the first order ODE x = f(x) , (2.2) from which it takes its dynamical nature. What is perhaps not so readily seen is that, given an initial condition x0, the trajectories obtained from both equations, xk and x(t), converge C.M. Campos, A. Mahillo and D. Mart ın de Diego to the argument of minima x . In particular, xk converges linearly to x while the function values f(xk) do so to the global minimum f(x ) at a rate of O(1/k) (Polyak, 1964, 1987). An initial improvement over GD was given by Polyak (1964): He introduced a novel two-step method, Polyak s Heavy Ball (PHB), also known as Classical Momentum (CM) after Sutskever et al. (2013). As it was originally presented, PHB/CM takes the form of the two-step method xk+1 = xk ηP(xk) + µ(xk xk 1) , (2.3) where P is a functional operator for which a root is sought and µ, η are small positive constants that condition the convergence of the method. In comparison with (2.1), (2.3) adds a new term, xk xk 1, the momentum of the discrete motion that incorporates past information in an amount controlled by µ, the so called momentum coefficient. When P is conservative, that is, when P = f, Polyak showed that, although the method s trajectory still converges linearly as GD s, it does so faster than GD s, that is, with a smaller geometric ratio (Polyak, 1964, 1987). The continuous analogue of (2.3) is the second order ODE x + ν(t) x + η(t)P(x) = 0 , (2.4) that turns out to be the equation of motion of a Lagrangian system when P = f (Lemma 4). Then x(t) traces the motion of a point mass in a well given by f. We therefore drop P and stick hereon with f. A further an crucial step towards improving GD (and PHB/CM) was given in 1983 by Nesterov, a former student of Polyak. He presented a new method, coined after him as Nesterov s Accelerated Gradient (NAG), similar to PHB/CM but with a slight change of unexpected consequences. A naive derivation from (2.3) is almost immediate: Introduce a new variable yk in (2.3) so it can be easily rewritten as the equivalent method yk+1 = xk ηk f(xk) , (2.5a) xk+1 = yk+1 + µk(xk xk 1) , (2.5b) where discrete-time dependence has been added to the coefficients µ, η for convenience. Replace the x s of the momentum term (right hand side of the second equation, Eq. 2.5b) by y s to get the new and non-equivalent method yk+1 = xk ηk f( xk) , (2.6a) xk+1 = yk+1 + µk( yk+1 yk) , (2.6b) where the bars are added to distinguish more easily both methods and underline that the sequences of points that they define are in fact different. This latter method (2.6) is NAG as it is usually presented. Nesterov showed in turn that his method accelerates the convergence rate of the function values down to O(1/k2) (see Nesterov, 1983, 2018). The original values of ηk, µk given by Nesterov are rather intricate, a simpler and commonly used version is yk+1 = xk η f( xk) , (2.7a) xk+1 = yk+1 + k k + 3( yk+1 yk) , (2.7b) Discrete Variational Calculus for Accelerated Optimization with η > 0 constant. As it is shown in Su et al. (2016), a continuous analogue of (2.7) is t x + f(x) = 0 , (2.8) which is but a particular case of PHB/CM s continuous analogue (2.4). Besides that, Su et al. (2016) also show that the function values converge to the minimum at an inverse quadratic rate, that is, f(x(t)) = f(x ) + O(1/t2). More generally (Remark 13), (2.6) is a natural discretization of a perturbed ODE of the form x + ν(t) x + η(t) f(x) = εF(x, x, t) , (2.9) which also is the equation of motion of a Lagrangian system (Lemma 4). In fact, it is this variational origin that Wibisono et al. (2016) take as point of departure. Once a particular type of time-dependent Lagrangian functions is considered, a subfamily of the so called Bregman Lagrangians, the variational approach captures acceleration in continuous-time into the derived discrete schemes achieving, in this case, a function value convergence rate of O(t1 n) with n 3 (see also Wibisono, 2016). 3. Bregman Lagrangians A Bregman Lagrangian is roughly speaking a time-dependent mechanical Lagrangian whose kinetic part is close to be a metric. They are built upon Bregman divergences (Br egman, 1967), a particular case of divergence functions. Bregman Lagrangians allow to define variational problems whose solutions minimize an objective function at an exponential rate (Betancourt et al., 2018). A divergence function over a manifold Q is a twice differentiable function B: Q Q R+ such that for all x, y Q we have: B(x, y) 0 and B(x, x) = 0; x B(x, x) = y B(x, x); and 2 xy B(x, x) is negative-definite. Divergence functions appear as pseudo-distances that are non-negative but are not, in general, symmetric. A typical divergence function over Q = Rn associated with a differentiable strictly convex function Φ: Rn R is the Bregman divergence: BΦ(x, y) = Φ(x) Φ(y) dΦ(y), x y . Observe that it is the remainder of the first order Taylor expansion of Φ around y evaluated at x, a sort of Hessian metric. Given a Bregman divergence over Rn, let us consider the time-dependent kinetic energy K(x, x, t) = BΦ(x + e α(t) x, x) , and the time-dependent potential energy U(x, t) = eβ(t)f(x) , C.M. Campos, A. Mahillo and D. Mart ın de Diego from which we define the Bregman Lagrangian L: TRn TR R by L(x, x, t) = eα(t)+γ(t) K(x, x, t) U(x, t) = eα(t)+γ(t) Φ(x + e α(t) x) Φ(x) e α(t) dΦ(x), x eβ(t)f(x) , where the time-dependent functions α(t), β(t), γ(t) are chosen to produce different algorithms. These functions verify what Wibisono et al. (2016) refer to as ideal scaling conditions, namely, γ(t) = eα(t) and β(t) eα(t) . (3.1) The first condition greatly simplifies several expressions that can be derived from the Bregman Lagrangian. For instance, when γ(t) = eα(t) is met, the associated Euler-Lagrange equations reduce to 2Φ x + e α(t) x d x + e α(t) x + eα(t)+β(t) f(x) = 0 . The second condition ensures convergence of the underlying trajectories to the minimum at a rate not slower than O(e β(t)). In the particular case where Φ(x) = 1 2 x 2, for which BΦ(x, y) = 1 2 x y 2, the Bregman Lagrangian takes the simple form L(x, x, t) = a(t)1 2 x 2 b(t)f(x) , (3.2) with a(t) = eγ(t) α(t) and b(t) = eα(t)+β(t)+γ(t). 4. Geometry of the Time-Dependent Lagrangian and Hamiltonian Formalisms Since Bregman Lagrangians are time-dependent, in this section, we introduce some needed geometric ingredients about non-autonomous mechanics and highlight some of their main invariance properties (see Abraham and Marsden, 1978; Libermann and Marle, 1987; de Le on and R. Rodrigues, 1987). Let Q be a manifold and TQ its tangent bundle. Coordinates (xi) on Q induce coordinates (xi, xi) on TQ. Therefore we have natural coordinates (xi, xi, t) on TQ R which is the velocity phase space for time-dependent systems. Given two instants (time values) a, b R, with a < b, and corresponding positions xa, xb Q, consider the set of curves: C2 a,b = C2([a, b], xa, xb) = {σ: [a, b] Q | σ C2 with σ(a) = xa, σ(b) = xb} . Given a time-dependent Lagrangian function L: TQ R R, define the action functional JL : C2 a,b R JL(σ) = Z b a L(σ (t), t) dt , (4.1) where σ : [a, b] TQ. Discrete Variational Calculus for Accelerated Optimization Using variational calculus, the critical points of JL are locally characterized by the solutions of the Euler-Lagrange equations: xi = 0 , 1 i n = dim Q . (4.2) For time-dependent Lagrangians it is possible to check that the energy EL : TQ R R, EL = L L = xi L where is the Liouville vector field on TQ (Libermann and Marle, 1987), is not, in general, preserved since d EL We now pass to the Hamiltonian formalism using the Legendre transformation FL: TQ R T Q R , where T Q is the cotangent bundle of Q whose natural coordinates are (xi, pi). The Legendre transformation is locally given by FL(xi, xi, t) = xi, L We assume that the Legendre transformation is a diffeomorphism (that is, the Lagrangian is hyperregular) and define the Hamiltonian function H : T Q R R by H = EL (FL) 1 , which induces the cosymplectic structure (ΩH, ηR) on T Q R with dt := pr 2 dt , ΩH = d(pr 1 θQ Hdt) = ΩQ + d H dt , where pri, i = 1, 2, are the projections to each Cartesian factor and θQ denotes the Liouville 1-form on T Q (Abraham and Marsden, 1978), given in induced coordinates by θQ = pi dxi. We also denote by ΩQ = d pr 1 θQ the pullback of the canonical symplectic 2-form ωQ = dθQ on T Q. In coordinates, ΩQ = dxi dpi. (Observe that now ΩQ is presymplectic since ker ΩQ = span{ / t}.) Therefore in induced coordinates (xi, pi, t): ΩH = dxi dpi + d H dt , ηR = dt . We define the evolution vector field EH X(T Q R) by i EHΩH = 0 , i EHdt = 1 . (4.3) In local coordinates the evolution vector field is: C.M. Campos, A. Mahillo and D. Mart ın de Diego The integral curves of EH are given by: t = 1 , xi = H pi , pi = H The integral curves of EH are precisely the curves of the form t 7 FL(σ (t), t) where σ: I Q is a solution of the Euler-Lagrange equations for L: TQ R R. From Equation (4.3) we deduce that the flow of EH verifies the following preservation properties LEHΩH = LEH(ΩQ + d H dt) = 0 , LEHdt = 0 . (4.5) Denote by Ψs : U T Q R T Q R the flow of the evolution vector field EH, where U is an open subset of T Q R. Observe that Ψs(αq, t) = (Ψt,s(αq), t + s) , αq T q Q , where Ψt,s(αq) = pr1(Ψs(αq, t)). Therefore from the flow of EH we induce a map Ψt,s : Ut T Q T Q , where Ut = {αq T Q | (αq, t) U}. Observe that if we know Ψt,s for all t, we can recover the flow Ψs of EH. From Equations (4.5) we deduce that Ψ s(ΩQ + d H dt) = ΩQ + d H dt , Ψ s(dt) = dt . (4.6) The following theorem relates the preservation properties (4.6) with the symplecticity of the map family {Ψt,s : T Q T Q}. Theorem 1 We have that Ψt,s : Ut T Q T Q is a symplectomorphism, that is, Ψ t,sωQ = ωQ. Proof First, observe that any vector Y(αq,t) T(αq,t)(T Q R) admits a unique decomposition: Y(αq,t) = Yαq(t) + Yt(αq) , where Yαq(t) Tαq T Q and Yt(αq) Tt R. Moreover, we have that dt, Yαq(t) = 0. Therefore, if we restrict ourselves to vectors tangent to the pr2-fibers T(αq,t) pr 1 2 (t) = V(αq,t) pr2 then we have the decomposition Y(αq,t) = Yαq(t) + 0t = Yαq(t) V(αq,t) pr2 Tαq T Q . From the second preservation property given in (4.5) we deduce that 0 = (dt)(αq,t), Yαq(t) = (Ψ sdt)(αq,t), Yαq(t) = (dt)Ψs(αq,t), TΨs(Yαq(t)) . Therefore TΨs(Yαq(t)) V(Ψs,t(αq),t+s)pr2 TΨt,s(αq)T Q and TΨs(Yαq(t)) = TΨt,s(Yαq(t)) + 0t+s TΨt,s(Yαq(t)) . Discrete Variational Calculus for Accelerated Optimization Using the first identity in (4.5) we deduce that (ωQ)αq(Yαq(t), Yαq(t)) = (ΩQ + d H dt)(αq,t)(Yαq(t), Yαq(t)) = (ΩQ + d H dt)(Ψs,t(αq),t+s)(TΨs(Yαq(t)), TΨs( Yαq(t))) = (ωQ)Ψs,t(αq)(TΨt,s(Yαq(t)), TΨt,s( Yαq(t))) where Yαq(t), Yαq(t) Tαq T Q T(αq,t) pr 1 2 (t). We conclude that Ψ t,sωQ = ωQ. 5. Discrete Variational Methods for Time-Dependent Lagrangian Systems Consider the set of discrete paths (or sequences) on Q for a fixed number of steps N N, that is, the set Cd(Q) = {xd : {0, 1, . . . , N} Q} = Q (N+1) Q . Then an appropriate discrete interpretation of the velocities are pairs in Q Q, a discrete version of TQ. A discrete time-dependent Lagrangian is a family of maps Lk d : Q Q R , k N , for which we define the discrete action map on the space of sequences as k=0 Lk d(xk, xk+1) , xd Cd(Q) . If we consider variations of xd with fixed end points x0 and x N and extremize Sd over x1, . . . , x N 1, we obtain the discrete Euler-Lagrange equations (DEL for short) xk Sd(xd) = D1Lk d(xk, xk+1) + D2Lk 1 d (xk 1, xk) = 0 , for all k = 1, . . . , N 1, where D1 and D2 denote the partial derivatives with respect to the first and second components, respectively. If, for all k, Lk d is regular, that is, the matrix D12Lk d = 2Lk d xk xk+1 is non-singular, then we locally obtain a well defined family of discrete Lagrangian maps: Fk,k+1 : Q Q Q Q (xk, xk+1) 7 (xk+1, xk+2(xk, xk+1, k)) . where the value of xk+2 is determined in terms of xk, xk+1 and k. In this setting, we can define two discrete Legendre transformations associated with Lk d, F Lk d : Q Q T Q, by the expressions F+Lk d : (xk, xk+1) 7 (xk+1, D2Lk d(xk, xk+1)) , F Lk d : (xk, xk+1) 7 (xk, D1Lk d(xk, xk+1)) . C.M. Campos, A. Mahillo and D. Mart ın de Diego We can also define the evolution of the discrete system on the Hamiltonian side, Fk,k+1 : T Q T Q, by any of the formulas: Fk,k+1 = F+Lk d (F Lk d) 1 = F+Lk d Fk 1,k (F+Lk 1 d ) 1 = F Lk+1 d Fk,k+1 (F Lk d) 1 , thanks to the commutativity of the following diagram. (xk 1, xk) Fk 1,k (xk, xk+1) F Lk d Fk,k+1 / (xk+1, xk+2) F Lk+1 d (xk, pk) Fk,k+1 / (xk+1, pk+1) Proposition 2 The discrete Hamiltonian map Fk,k+1 : (T Q, ωQ) (T Q, ωQ) is a symplectic transformation, that is, ( Fk,k+1) ωQ = ωQ . Proof Using similar arguments to the autonomous case (Marsden and West, 2001), we deduce that (F+Lk d) ωQ = (F Lk d) ωQ . From the definition of Fk,k+1 : T Q T Q we deduce that ( Fk,k+1) ωQ = (F+Lk d (F Lk d) 1) ωQ = ((F Lk d) ) 1((F+Lk d) ωQ) = ωQ . Given the map Fk,k+1(qk, pk) = (qk+1, pk+1), we immediately have the map (xk, pk, kh) 7 (xk+1, pk+1, (k + 1)h) on T Q R where we now give explicit information of the evolution of discrete time. Let see the relation of these discrete maps Fk,k+1 : Q Q Q Q and Fk,k+1 : T Q T Q with the Euler-Lagrange equations and Hamilton equations of a time-dependent Lagrangian system. Given a regular Lagrangian function L: TQ R R and a sufficiently small time step h > 0, we are going to define an h-and-k-dependent family of discrete Lagrangian functions Lk d,h : Q Q R as an infinitesimal approximation to the continuous action JL defined in expression (4.1). As intermediate step, we first consider the exact time-dependent discrete Lagrangian associated with a regular Lagrangian L which is given by the expression Lk,E d,h (x0, x1) = 1 kh L(x0,1(t), x0,1(t), t) dt , where x0,1(t) is the unique solution of the Euler-Lagrange equations for L with x0,1(kh) = x0 and x0,1((k + 1)h) = x1, (see Hartman, 2002; Marrero et al., 2016). Then for a sufficiently Discrete Variational Calculus for Accelerated Optimization small h, the solutions of the DEL for Lk,E d,h lie on the solutions of the Euler-Lagrange equations for L (Theorem 1.6.4, Marsden and West, 2001). In practice, Lk,E d,h (x0, x1) will not be available, therefore we take an approximation, Lk d,h(x0, x1) Lk,E d,h (x0, x1) , using some quadrature rule. Then, as we have seen, the scheme derived from the DEL will be geometric integrators for Equations (4.4) preserving the symplectic form in the sense of Theorem 1 (see Patrick and Cuell, 2009). Remark 3 As we have commented in the Introduction, one of the main advantages of the proposed approach is the possibility to use other options to derive different numerical methods for optimization by only discretizing a unique function, the action functional. Of course, there are many different ways to do it (Marsden and West, 2001). For instance, we can combine several discrete Lagrangians together to obtain a new discrete Lagrangian with higher order (composition methods) or similarly obtaining splitting methods (Campos and Sanz-Serna, 2017). Also, we can easily derive symplectic partitioned Runge-Kutta methods or symplectic Garlekin methods using polynomial approximations of the trajectories and a numerical quadrature to approximate the action integral (Campos, 2014). Moreover, it is possible to adapt the variational integrators to a non-euclidean setting using appropriate retraction maps. 6. Discretization of Lagrangian Systems with Forces Our intention here is to continue looking for numerical approximations to the time-dependent Euler-Lagrange equations but considering additionally an external force that decreases jointly with the time-step parameter h. With it, we will obtain a whole family of algorithms whose behavior resembles that of the Nesterov method. Fortunately, discrete mechanics is also adapted to the case of external forces (see Marsden and West, 2001). To this end, in addition to a time-dependent Lagrangian function L: TQ R R, we have an external force given by a fiber preserving map F : TQ R T Q given locally by F(x, x, t) = (xi, Fi(x, x, t)) . Given the force f, we derive the equations of motion of the forced system modifying Hamilton s principle to the Lagrange-d Alembert principle, which seeks curves σ C2 a,b satisfying a L(σ (t), t) dt + Z b a F(σ (t), t)δσ(t) dt = 0 , (6.1) for all δσ TσC2 a,b. Using integration by parts, we derive the forced Euler-Lagrange equations, which have the following coordinate expression: To discretize these equations, we consider, as before, a family of Lagrangian functions Lk d : Q Q R and two discrete forces (F k d ) : Q Q T Q, which are fiber preserving C.M. Campos, A. Mahillo and D. Mart ın de Diego in the sense that πQ (F k d ) = pr , where pr (x , x+) = x . Combing both forces, we obtain F k d : Q Q T (Q Q) by F k d (xk, xk+1), (δxk, δxk+1) = (F k d ) (xk, xk+1)δxk + (F k d )+(xk, xk+1)δxk+1 . As in (6.1), we have a discrete version of the Lagrange-d Alembert principle for the discrete forced system given by Lk d and F k d : k=0 Lk d(xk, xk+1) + k=0 F k d (xk, xk+1), (δxk, δxk+1) = 0 , for all variations {δxk}N k=0 vanishing at the endpoints, that is, δx0 = δx N = 0. This is equivalent to the forced discrete Euler-Lagrange equations: D1Lk d(xk, xk+1) + D2Lk 1 d (xk 1, xk) + (F k d ) (xk, xk+1) + (F k 1 d )+(xk 1, xk) = 0 , (6.2) for all k = 1, . . . , N 1. 7. The Variational Derivation of PHB/CM and NAG As seen in Section 2, NAG can be derived naively from PHB/CM. Besides, under suitable conditions on µ, η and the starting points, both methods converge to a minimum of f, the latter, NAG, doing so faster (Polyak, 1964; Nesterov, 1983). Questions arise: What makes NAG faster than PHB/CM? Can this be exploited to obtain even faster methods? Can it be generalized? Questions that boil down to how NAG is fundamentally derived from PHB/CM. To begin with, note that the NAG equations (2.6) can be rewritten only in terms of the x s, as in (2.3), or only in terms of the y s, yielding the equations xk = µk xk 1 ηk f( xk) µk ηk f( xk) ηk 1 f( xk 1) , (7.1a) yk = µk 1 yk 1 ηk f( yk + µk 1 yk 1) . (7.1b) The first, Eq. (7.1a), when compared with (2.3) shows an extra term, µk ηk f( xk) ηk 1 f( xk 1) , that in fact points to the very origin of the method: an additional forcing term. The second, Eq. (7.1b), compared again with (2.3) shows that the y-trajectory is obtained almost as if it was computed by PHB/CM but evaluating f at a future point, yk + µk 1 yk 1, which informs better the method on how to advance towards the minimum. Besides the convergence towards the minimum, it can be shown that both methods are a time discretization of the second order differential equation (Su et al., 2016) x + ν(t) x + η(t) f(x) = 0 , (7.2) a well known fact in the literature, equation that furthermore is variational in general (see Lemma 4). A fact that is not so well known is that NAG discretizes better the equation Discrete Variational Calculus for Accelerated Optimization when including a force term proportional to the underlying time step and, moreover, it can be derived, as well as PHB/CM, from a variational approach, in the geometric integration sense (see Sections 5 and 6). We first give a rather simple and direct result whose purpose is to establish properly the continuous setting over which the discretizations will be built and from which the methods can be derived. Lemma 4 Given a vector field P : Rn Rn, consider the second order differential equation x + ν(t) x + η(t)P(x) = εd h η(t)P(x) i , (7.3) where ν, η: R+ R are continuous time-dependent real valued functions and where ε R is a constant. If P is conservative, that is, if P = f, then (7.3) corresponds to the equation of motion of the (forced) time-dependent Lagrangian system: L(x, x, t) = a(t)1 2 x 2 b(t)f(x) , (7.4a) F(x, x, t) = εa(t)d a(t)P(x) , (7.4b) in which f is the field s potential and where a(t) = exp( R t 0 ν(s)ds) , and b(t) = a(t)η(t) , (7.5) Proof Assume P is conservative and let f denote its potential. Then the Euler-Lagrange equation for (7.4) is a(t) x + a (t) x + b(t)P(x) = εa(t)d a(t)P(x) . (7.6) Dividing by a(t), and taking into account that, from (7.5) we have that ν(t) = a (t) a(t) and η(t) = b(t) a(t) , (7.7) we obtain (7.3). Remark 5 P being conservative is not a necessary condition for (7.3) to be derived from the Lagrange-d Alembert principle. In order to be variational, Equation (7.6) requires a Lagrangian of the form L(x, x, t) = a(t)1 2 x2 + c(x, t), x + d(x, t) for some unknown functions c, d, which implies t + b(t)P(x) . A vector field P satisfying this last relation need not be conservative. C.M. Campos, A. Mahillo and D. Mart ın de Diego Next, the result that links the previous continuous equation (7.3) with PHB/CM and NAG, showing, in particular, that NAG is a forced version of PHB/CM, between which the transition is immediate. Theorem 6 Given a real valued function f : Rn R and a vector field P : Rn Rn, consider the time-dependent discrete Lagrangian and forces Lk d(z0, z1) = ak 1 2 z1 z0 2 b k f(z0) b+ k+1f(z1) , (7.8a) (F k d ) (z0, z1) = ak 1 ak (b k + b+ k )P(z0) , and (7.8b) (F k d )+(z0, z1) = (b k + b+ k )P(z0) . (7.8c) where {ak}k 0, {b k }k 0, {b+ k }k 0, are arbitrary sequences of real numbers. If f is regular enough, so that P = f, and ak is never null, then the free and forced equations of motion for Lk d and (Lk d, (F k d ) , (F k d )+) are, respectively, equivalent to the following recursive schemes yk+1 = xk ηk P(xk) , yk+1 = xk ηk P( xk) , (7.9a) xk+1 = yk+1 + µk(xk xk 1) , xk+1 = yk+1 + µk( yk+1 yk) , (7.9b) µk+1 = ak ak+1 and ηk = b k + b+ k ak , (7.10) for k 0. Conversely, given a vector field P : Rn Rn and two arbitrary sequences of real numbers {µk+1}k 0 and {ηk}k 0, consider the sequences of pairs of points given in equations (7.9a)- (7.9b). If P is conservative and µk+1 is never null, then both schemes are variational. Moreover, they are equivalent to the equations of motions for the free and forced timedependent discrete Lagrangian systems given in (7.8a)- (7.8c), for which f is the field s potential and a0 = 1, ak+1 = ak/µk+1, k 0 , b k = 1 2akηk, k 0 . (7.11) Proof Partial differentiation of the Lagrangian gives D1Lk d(z0, z1) = ak z0 b k f(z0) , D2Lk d(z0, z1) = ak z0 b+ k+1 f(z1) , from where it is readily seen that the DEL equations with forces (6.2) are ak+1 z1 + ak z0 (b k+1 + b+ k+1) f(z1) = = ak ak+1 (b k+1 + b+ k+1) f(z1) (b k + b+ k ) f(z0) , where the RHS is null for the non-forced DEL equations. Dividing by ak+1 and using the relations (7.10), we get z1 µk+1 z0 + ηk+1 f(z1) = µk+1 (ηk+1 f(z1) ηk f(z0)) , (7.12) Discrete Variational Calculus for Accelerated Optimization where, again, the right hand side is null for the non-forced case. Replacing zi with xk+i in the non-forced case, and with xk+i in the forced one, taking into account that P = f, and using the equations in (7.9a), we obtain those in (7.9b). The converse is immediate. Now remarks are in order that will summarize some points that have been mentioned earlier and underline others that haven not been yet. Remark 7 (One-to-one correspondence) Note that both equations in (7.9a) are formally the same, whereas in (7.9b) there is a difference in the last term: While PHB/CM uses x s, NAG considers y s. This is a slight change that nonetheless defines different schemes and in which the forcing term is hidden. This result not only shows that NAG is a forced version of PHB/CM, but that the schemes are in a natural bijective correspondence. Remark 8 (Strategies) A specific method is defined by a strategy: a pair of coefficients (µ, η). This work and especially Theorem 6 focus on variational strategies, those pairs µ, η: N R that can be derived variationally from a time-dependent Lagrangian of the form (3.2). Being the original strategy of Polyak s method constant, it falls within this class of variational momentum-descent methods, whereas the original method by Nesterov is non-variational and belongs to a broader class of generalized momentum-descent methods where the coefficients might depend on the objective function itself (confer with Polyak, 1964; Nesterov, 1983; see also Eq. 8.12). Remark 9 (Initial conditions) Usually the initial condition, (x0, v0) or (x0, p0) in phase space, or (x0, x1) in configuration space, is crucial for the proper simulation of the dynamical system. Here, however, the dynamics are a tool and generally an initial condition so that x1 = x0 or y0 = x0, where x0 = x0 is close to the minimum, will suffice, which corresponds to sticking the ball to the bowl s wall and leave it roll. Remark 10 (Natural trajectory) From the schemes definitions, the sequences {xk} k=0 and { xk} k=0 are the natural (on track) dynamical trajectories towards the minimum of f, while {yk} k=1 and { yk} k=1 are offroad marks that limit these trajectories like slalom flags. The latter are, however, asymptotically close to the former and, hence, to the minimum. Remark 11 (Discrete flow) If one wants to compare the trajectories obtained from both methods, perhaps Equations (7.9) are better suited. If one is solely interested in a simple implementation to compute the minimum, then Equations (2.3) and (7.1b) are a good alternative since they can easily be rewritten to give discrete flow updates in the form momentum first, then position as in Sutskever et al. (2013): xk = µk 1 xk 1 ηk f(xk) , yk = µk 1 yk 1 ηk f( yk + µk 1 yk 1) , (7.13a) xk+1 = xk + xk , yk+1 = yk + yk , (7.13b) where both methods should be initialized with x0 = y0 and x 1 = y 1 = 0. Both approaches, Equations (7.9) and Equations (7.13), have been considered for the simulations of Section 8. C.M. Campos, A. Mahillo and D. Mart ın de Diego Remark 12 (Force approximation) The action induced by the discrete forces in (7.8) is a second order approximation to the action induced by the continuous force (7.4b). Indeed, given continuous coefficients a(t), b(t), define ak = a(kh)/h2 and b k so that b k +b+ k = b(kh), and similarly for a continuous path x(t) and a variation δx(t) of it. Then k=0 F k d (xk, xk+1) (δxk, δxk+1) = ak (b k + b+ k ) f(xk) δxk + h k=1 (b k 1 + b+ k 1) f(xk 1) δxk a(t) b(t) f(x(t)) + b(t h) f(x(t h)) δx(t) dt + O(h2) h 2 h a(t h)d a(t) f(x(t)) δx(t) dt + O(h2) a(t) f(x(t)) δx(t) dt + O(h2) where we have considered a mid point quadrature rule to establish the second equality, the rest follows. Remark 13 (Force evanescence) As remarked, NAG can be viewed as an approximation to a forced continuous Lagrangian system, where the force is proportional to the time step that is used a posteriori for the discretization. Although its contribution is non-null along the whole trajectory, it however vanishes not only locally, when h decreases, but also asymptotically in time, when t or k increase. 8. Simulations Numerical experiments are performed considering different elements, namely: The time-dependent coefficients that appear in the Lagrangian, a(t) and b(t), for the simple case (3.2); The discretization scheme used to approximate the Lagrangian action; and obviously, The objective function to be minimized, f(x). 8.1 Lagrangian Coefficients We consider three different Lagrangians or, more precisely, three different pairs of timedependent coefficients (a(t), b(t)) given, as in (3.2), by time-dependent exponents triples (α(t), β(t), γ(t)) satisfying the ideal scaling conditions (3.1), which ensure that argmin f is an attractor of the underlying dynamical system. Discrete Variational Calculus for Accelerated Optimization 8.1.1 Potential Dilation The first Lagrangian under consideration is a potential dilation of a mechanical Lagrangian, namely, L(x, x, t) = tn 1 2 x 2 f(x) , (8.1) whose Euler-Lagrange equation is t x + f(x) = 0 , (8.2) which is the equation considered in Su et al. (2016). A naive choice of exponents from which to obtain the time-dependent coefficients a(t) = b(t) = tn is α(t) = 0, β(t) = 0, γ(t) = n log t. (8.3) However, they do not satisfy the ideal scaling conditions. A triple that does meet this requirement is α(t) = log p log t, β(t) = 2(log t log p), γ(t) = p log t + log p, (8.4) where p = n 1, but only when n 3, thus, showing the magic n = 3 in (8.2) of NAG (Su et al., 2016). It is, in fact, the only choice satisfying the scaling conditions and gives an optimal rate of convergence not slower than O(1/t2). 8.1.2 Modified Potential Dilation The Lagrangian L(x, x, t) = tn 1 2 x 2 Dtn 3f(x) , (8.5) whose Euler-Lagrange equation is t x + Dtn 3 f(x) = 0 , is the Lagrangian considered by Wibisono et al. (2016) for the metric case and corresponds to the exponents α(t) = log p log t, β(t) = p log t + log C, γ(t) = p log t + log p, (8.6) where C = D/p2 and p = n 1. They satisfy the ideal scaling conditions for n > 1, giving a rate of convergence not slower than O(1/tn 1) (confer with Wibisono et al., 2016, in particular for specific details on the constant C). 8.1.3 Exponential Dilation Finally, the exponentially dilated Lagrangian L(x, x, t) = eλt 1 2 x 2 f(x) , (8.7) C.M. Campos, A. Mahillo and D. Mart ın de Diego whose equation of motion is precisely the one of a mechanical system with linear damping, x + λ x + f(x) = 0 , (8.8) corresponds to the choice exponents α(t) = log λ, β(t) = 2 log λ, γ(t) = λt + log λ. (8.9) We note here three points. First, it is the unique choice for a(t) = b(t) = eλt that satisfies the ideal scaling conditions. Second, in principle this choice gives a theoretical convergence rate of O(1), fortunately it can be reduced down to o(1) (Attouch et al., 2021, Th. 2.5). Third and more notably, although the Lagrangian (8.7) is explicitly time-dependent, the Euler-Lagrange equation (8.8) is autonomous and it introduces a linear time-independent damping term in the equation. 8.2 Discretizations Using the trapezoidal rule to approximate the action of the above Lagrangians, we retrieve common NAG coefficients that appear in the literature. With some abuse of notation, we write in general a(k) := a(tk) + a(tk+1) 2h2 , and b (k) := b(tk) where tk = kh. 8.2.1 Bounded Coefficients from the Potential Dilation For the continuous time-dependent coefficients a(t) = b(t) = tn, with n = p + 1, the trapezoidal rule yields the discrete time-dependent coefficients a(k) = tn k + tn k+1 2h2 , b (k) = tn k 2 , from which we obtain the coefficients µ(k) = kn + (k 1)n kn + (k + 1)n , η(k) = 2kn kn + (k + 1)n h2. (8.10) Both coefficients are strictly increasing and bounded above by 1 and h2, respectively. To avoid integer overflow and slightly reduce computational cost in final implementations, these expressions can be simplified to µ(k) = 2k n 2k + n + o(1), η(k) = 2k 2k + n + o(1) h2. For the particular case n = 3, that is p = 2, we get µ(k + 3/2) = k k+3 + O(1/k3) as in (2.7). Discrete Variational Calculus for Accelerated Optimization 8.2.2 Unbounded Coefficients from the Modified Potential Dilation In this case, the continuous time-dependent coefficients are a(t) = tn and b(t) = Dt2n 3, as in Wibisono et al. (2016), which yield µ(k) = kn + (k 1)n kn + (k + 1)n , η(k) = D 2kn kn + (k + 1)n tn 3 k h2. (8.11) As earlier, µ(k) = 2k n 2k + n + o(1), η(k) = D 2k 2k + n + o(1) tn 3 k h2. This time, η is unbounded when n 4, and bounded above by Dh2 when n = 3. Similarly, Betancourt et al. (2018) suggest a palindromic split Hamiltonian method with 7 stages that can be recovered from the proposed perspective considering the discrete Lagrangian Lk d(xk, xk+1) = tn k+1/2 2 Dtn 3 k+1/2 f(xk) + f(xk+1) Note that it is not obtained by a trapezoidal approximation of the Lagrangian (8.5), but it is still a discretization of it for which µ(k) = 2k 1 n , η(k) = D(2k + 1)2n 3 + (2k 1)2n 3 2(2k + 1)2n 3 tn 3 k+1/2h2. η(k) = D 2k 2k + 2n 3 + o(1) tn 3 k+1/2h2 . 8.2.3 Constant Coefficients from the Exponential Dilation Taking a(t) = b(t) = eλt yields µ(k) = 1 + e λh 1 + e+λh , η(k) = 2 1 + eλh h2. (8.12) For λ = 1 and h = 0.1024, µ 0.9 and η 0.01, values that often appear in the literature, as in Sutskever et al. (2013). In general, any pair of constant coefficients µ, η > 0 can be obtained from values λ R, h > 0. 8.2.4 The Actual Method by Wibisono, Wilson, and Jordan Strictly speaking, the method by Wibisono et al. (2016) is based on NAG, but it differs from it. Although they consider the previous Lagrangian (8.5) and experiment with it in Betancourt et al. (2018); Jordan (2018), what is proposed in Wibisono et al. (2016) is the C.M. Campos, A. Mahillo and D. Mart ın de Diego 3-phase scheme xk+1 = p k + pzk + k k + pyk , (8.13a) yk = argmin y fp 1(y; xk) + N php y xk p , (8.13b) zk = zk 1 Dk ptp 2 k h2 f(yk) , (8.13c) with z0 = y0 = x0 and where p = n 1, fp 1(y; xk) is the (p 1)-th Taylor expansion of f about xk and N is a constant related to D and p that ensures convergence. Note that for n = 3 (p = 2), the optimization problem (8.13b) is explicit and reduces to N h2 f(xk) , (8.14) but it is implicit in general, which increases the cost of the method, aside of having to compute the Hessian and higher derivatives of f, either explicitly or by autodifferentiation. In Section 8.4, we will refer to method (8.13) by WWJ, after the authors, and consider it as a (modified) NAG method. 8.3 Objective Functions Several objective functions are considered: a highly dimensional quadratic function with tridiagonal matrix representation, a generalized Rosenbrock function, yet another test function for momentum-descent methods, and one that combines a generalized logistic function with a mean loss and that is often used in Neural Networks. 8.3.1 Highly Dimensional Quadratic Function In Betancourt et al. (2018), they consider the quadratic map on Rn 2 x, Σ 1x , (8.15) where Σ is the matrix whose elements are Σij = ρ|i j|, with ρ = 0.9 and n = 50, and whose inverse Σ 1 is the tridiagonal matrix Σ 1 = 1 1 ρ2 1 ρ ρ 1 + ρ2 ρ ... ... ... ρ 1 + ρ2 ρ ρ 1 8.3.2 Generalized Rosenbrock Function The Rosenbrock function (Rosenbrock, 1960/61), whose expression is f(x, y) = (a x)2 + b(y x2)2 , Discrete Variational Calculus for Accelerated Optimization with a, b > 0, represents a banana-shaped flat-valley surrounded by steep walls with a unique critical point and global minimum at a, a2, whose search by numerical means is difficult, hence its use to test and benchmark optimizers. We consider here its generalization to higher dimensions, n > 2, namely (1 xi)2 + 100(xi+1 x2 i )2 . (8.16) As in the two-dimensional case, the function has a global minimum at (1, 1, . . . , 1) but, unlike it, also has a local minimum close to ( 1, 1, . . . , 1) (the higher is the dimension, the closer it gets). 8.3.3 Yet Another Test Function (YATF) Another example that might hinder the search of a minimum is the following f(x, y) = sin(2x2 y2 + 3) cos(x + 1 exp(2y)) , (8.17) which has a local minimum close to (0.32, 1.60). 8.3.4 Multinomial Logistic Regression In artificial neural networks (ANN), the activation function of a node (or neuron) defines the output of that node given an input (or set of inputs). Supervised learning is a learning paradigm of the training process of the ANN. Different choices are available for the activation function and the training process, which is a subject that might be in some cases controversial within the ANN community, but that is not the object of this work. We consider a shallow neural network with a single layer for classification in n classes or targets given m features. Upon reaching the neurons, the inputs (features) are weighted and possibly biased, which in fact is the model to be determined through the learning process, w Rm n, b Rn : x Rm 7 w x + b Rn . The chosen activation function is a classifier, a generalization of the logistic function, the multinomial logistic function (a.k.a. softmax), σ(z) = ezj P k ezk As loss function, we choose the cross-entropy or log-loss, H(ˆy, y) = y, log ˆy , where ˆy Rn is the computed output, y {0, 1}n with P j yj = 1 is the expected output (class), and log is applied componentwise. We take as objective function the average loss over a training dataset D of length |D|, namely, f(w, b) = 1 (x,y) D y, (log σ)(w x + b) . (8.18) It is important to note here that f is the sum of convex functions and therefore itself convex, however f need not have a global minimum but an asymptotic infimum: 0 as, for instance, exp(t) opposed to cosh(t). This depends on the data fed for the training process. C.M. Campos, A. Mahillo and D. Mart ın de Diego 8.4 Experiments Many experiments have been performed, we optimized each test function using the aforementioned methods (and others) set with different parameters and initial guesses, from which we present here a small but suggestive sample. In summary, we minimize the test functions (8.16) and (8.17), the quadratic function (8.15), and the loss (8.18) with the PHB/CM and NAG methods (7.9) given by the constant, bounded and unbounded coefficients (8.12), (8.10), and (8.11), respectively, and WWJ s method (8.13), the latter three for n = 3, 4, for a total of seven methods. Each objective is minimized using its own initial guess, time-step, and number of epochs, which are fixed for the seven simulations. We set Rosenbrock s test function with 30 dimensions and seek for its global minimum at (1, . . . , 1) from (0, . . . , 0) at a pace of h = 0.01 for 20000 epochs. In the case of the YATF, there is a local minimum near (0.32, 1.60) which we seek from ( 0.25, 0.35) with time-step h = 0.01 for 3800 epochs. A random point on a sphere of radius 50 is the initial guess for the 50-dimensional quadratic function, whose global minimum, sought for 10000 epochs with h = 0.1, is clearly at the origin. For the convergence tests, the log-loss function is fed with 10 arbitrary samples, with 4 features and 3 targets each, which defines an optimization problem of dimension 10 with no global minimum, hence we will seek for the infimum for 12000 epochs at a pace of h 0.945 from a random weight distribution with null biases. For an actual ANN test, the log-loss function is fed with the widely used Iris dataset (Dua and Graff, 2017; see below for more details). The methods have been implemented in Julia v1.8.2 (Bezanson et al., 2017), using solely as nonnative libraries NLsolve.jl (Mogensen et al., 2020) to solve the side problem (8.13c), Plots.jl (Breloff, 2021) and PGFPlots X.jl for plotting, and Pluto.jl for an interactive notebook. Methods, functions, and simulations are available online (Campos, 2022a,b). All plots except Figures 1 and 7 represent the objective function residual against the epoch in log-log scale. In Figure 1, we can see how the trajectories computed by PHB/CM and NAG for the YATF pass by its minimum about (0.32, 1.60). They start at the bottom left and go upward until they realize they have overreached the minimum and, about (0.7, 1.9), they back up. Although not shown in the figure, this motion is repeated successively, but each time they back up, they do so sooner and closer to the minimum, like a heavy ball in a bowl. The trajectories shown in the figure correspond to the iterations 765 to 1450 of the methods. Similarly we observe in Figure 2 that NAG oscillates less than PHB/CM. Each downward peak corresponds to the trajectory passing by the minimum of the YATF. In fact, the trajectories shown in Fig. 1 correspond to the first peak in blue of Fig. 2. These oscillations where the trajectories pass by the minimum of the objective function back and forth, and the fact that NAG oscillates less than PHB/CM slightly outperforming it are common aspects of all the simulations performed, reason why in the remaining figures we focus solely on NAG methods, where several aspects are worth noting. Figures 3-6 compare the seven methods enumerated above, one figure for each objective function presented in Section 8.3, and following the same order. Each figure is made up of two plots: the top one is composed of methods with n = 3, the bottom one of methods with n = 4, whereas both include the NAG method with constant coefficients for proper reference. Discrete Variational Calculus for Accelerated Optimization Yet Another Test Function 0.0 0.2 0.4 0.6 0.8 PHB/bounded (n = 3) NAG/bounded (n = 3) Figure 1: Trajectory slices nearby the local minimum of the YATF using PHB/CM (pale) and NAG (strong) with the bounded coefficients from the Lagrangian s polynomial dilation with n = 3. A nonlinear grayscale gradient indicates the minimum s location in black. Yet Another Test Function 103.0 103.3 PHB/bounded (n = 3) NAG/bounded (n = 3) PHB/bounded (n = 4) NAG/bounded (n = 4) PHB/constant NAG/constant Figure 2: YATF residual values along PHB/CM (pale) and NAG (strong) trajectories for coefficients from the polynomially dilated Lagrangian with n = 3 (blue) and n = 4 (violet), and from the exponentially dilated Lagrangian (green). C.M. Campos, A. Mahillo and D. Mart ın de Diego Quadratic function (dimension 50) NAG/constant NAG/bounded (n = 3) NAG/unbounded (n = 3) WWJ (n = 3) 101 102 103 104 NAG/constant NAG/bounded (n = 4) NAG/unbounded (n = 4) WWJ (n = 4) Figure 3: Quadratic test function values along trajectories computed with NAG for constant (green), bounded (blue), and unbounded (violet) coefficients, and WWJ (red); the latter three set with n = 3 (top) and n = 4 (bottom). Discrete Variational Calculus for Accelerated Optimization Rosenbrock s test function (dimension 30) NAG/constant NAG/bounded (n = 3) NAG/unbounded (n = 3) WWJ (n = 3) 103.0 103.5 104.0 NAG/constant NAG/bounded (n = 4) NAG/unbounded (n = 4) WWJ (n = 4) Figure 4: Rosenbrock s test function values along trajectories computed with NAG for constant (green), bounded (blue), and unbounded (violet) coefficients, and WWJ (red); the latter three set with n = 3 (top) and n = 4 (bottom). C.M. Campos, A. Mahillo and D. Mart ın de Diego Yet Another Test Function (dimension 2) NAG/constant NAG/bounded (n = 3) NAG/unbounded (n = 3) WWJ (n = 3) 103.0 103.3 NAG/constant NAG/bounded (n = 4) NAG/unbounded (n = 4) WWJ (n = 4) Figure 5: YATF residual values along trajectories computed with NAG for constant (green), bounded (blue), and unbounded (violet) coefficients, and WWJ (red); the latter three set with n = 3 (top) and n = 4 (bottom). Discrete Variational Calculus for Accelerated Optimization Multinomial logistic loss (dimension 10) NAG/constant NAG/bounded (n = 3) NAG/unbounded (n = 3) WWJ (n = 3) NAG/constant NAG/bounded (n = 4) NAG/unbounded (n = 4) WWJ (n = 4) Figure 6: Logistic loss values along trajectories computed with NAG for constant (green), bounded (blue), and unbounded (violet) coefficients, and WWJ (red); the latter three set with n = 3 (top) and n = 4 (bottom). C.M. Campos, A. Mahillo and D. Mart ın de Diego Observe first that, the three methods with n = 3 behave similarly for the four objective functions. This is not surprising since the Lagrangians considered reduce to a potentially dilated mechanical Lagrangian in which the objective function, in the cases of unbounded NAG and WWJ, is coupled with a further constant coefficient, D = 1/4 when n = 3, which slightly delays convergence when compared to bounded NAG. This coefficient needs to be small in order to ensure convergence according to Wibisono et al. (2016). When n = 4, these three methods clearly show their differences. Most importantly, unbounded NAG (8.11) blows up in Figure 3 (highly dimensional quadratic function). This is due to the increasing and unbounded learning rate coefficient η that is ignored when the gradient is almost null but brakes the method when it steps at a point where the gradient is not negligible. Therefore we infer that unbounded NAG is not suited for long runs when n 4 and should be restarted to avoid blow-up. Surprisingly WWJ is not affected by this problem even though the same coefficient appears in its definition, Eq. (8.13). The method is numerically stable thanks to the fast convergence of the trajectory towards the minimum. In fact, WWJ really shows up when n = 4, where it clearly outperforms its NAG counterpart, unbounded NAG, as well as bounded NAG, but may stall when the simulation has advanced, Fig. 3. Besides, there is a trick into its performance, when n 4 the method must solve a side optimization problem, (8.13c), which is solved here using the NLsolve.jl library, something that it somewhat redundant and suffers from the curse of dimensionality: it is around 10 times slower than the other methods for the low dimensional case (YATF) and more than 100 times slower in the high dimensional ones (Rosenbrock and quadratic function). Nonetheless, this disadvantage might decrease or even disappear when the Bregman divergence is not the usual Euclidean norm (something worth exploring), since the associated methods are unlikely to be explicit. With regards to constant NAG, the results are somehow paradoxical. Recall that the exponentially dilated Lagrangian is the only case in which a good convergence rate is not ensured, Eq. (8.9), however it is the method that in general performs the best, Figures 3-5. Furthermore, although within the Machine Learning community it is perhaps the most popular method among the analyzed, it is the one that has performed the worst in the purely ML scenario, Figure 6. Fig. 6 is where the theoretical convergence rates are better seen. In this figure, the values obtained by unbounded NAG and WWJ coalesce for both, n = 3 and n = 4. In addition to the previous optimization problems, we apply the analyzed methods to a simple classification problem using Firsher s Iris dataset (Dua and Graff, 2017). The dataset consists of 150 entries (or samples) with 7 fields: 4 features and 3 targets. Therefore we consider a shallow neural network with a single layer of 3 neurons with 4 inputs. The input data (the features) are weighted and biases upon entry into the network, the computed output is compared with the expected output (targets) using the multinomial logistic function. This process is summarized by the objective function (8.18). For simplicity, we only consider four methods: constant, bounded, and unbounded NAG, and WWJ, the latter three with n = 3. We train the network (or optimize the objective function) at an increasing number of epochs (from 25 up to 250) for 1000 runs. At each run and for each number of epochs, the samples are randomly split in two: 100 samples for training and 50 samples for testing. At the same time, the initial weights of the network are drawn from a normal distribution with null mean and standard deviation σ = 10. This Discrete Variational Calculus for Accelerated Optimization Multinomial logistic loss 25 50 75 100 150 250 accuracy (%) NAG/constant NAG/bounded (n = 4) NAG/unbounded (n = 4) WWJ (n = 4) Figure 7: Average accuracies for Iris dataset achieved by NAG with constant (green), bounded (blue), and unbounded (violet) coefficients, and WWJ (red); the latter three set with n = 4. initial guess is kept for the four methods. Then the network is trained and the accuracy of network with the computed weights is measured (percentage of correct matches for the testing data). Figure 7 shows the average accuracy along the 1000 runs for each method versus the number of epochs. This figure is clearly consistent with what is obtained in Fig. 6. 9. Conclusions and Future work In this paper, we have studied the relation between accelerated optimization and discrete variational calculus, proving a symplecticity property for the continuous differential equation in Theorem 1 which is also preserved by the corresponding discrete variational methods. We have derived Classical Momentum (CM) or Polyak s Heavy Ball (PHB), and Nesterov s Accelerated Gradient methods (NAG) from the discrete Hamiltonian and Lagrange-d Alembert principles in Theorem 6 adding forces in the picture and proven a one-to-one correspondence. Several simulations were performed showing the applicability of our techniques to optimization. Among all the methods, NAG with the constant coefficients from the exponentially dilated Lagrangian, aside from being the simplest choice, it also seems to be the best one for general purpose applications according to the simulations. In a future paper, we will study whether the proposed optimization algorithm generated by using Lagrange-d Alembert principle achieves the accelerated rate for minimizing both strongly convex functions and convex functions (Wibisono et al., 2016; Shi et al., 2019). The C.M. Campos, A. Mahillo and D. Mart ın de Diego main idea is to discretize, using discrete variational calculus, the continuous Euler-Lagrange equations (with or without forces) while maintaining their convergence rates (see Vaquero et al., 2021, for recent advances in this topic). Moreover, the extension to problems of accelerated optimization in manifolds will be given using discrete variational calculus and well-know optimization techniques with retraction maps (Absil et al., 2008; see also Lee et al., 2021). Acknowledgments D. Mart ın de Diego acknowledges financial support from the Spanish Ministry of Science and Innovation, under grants PID2019-106715GB-C21, from the Spanish National Research Council (CSIC), through the Ayuda extraordinaria a Centros de Excelencia Severo Ochoa R&D (CEX2019-000904-S). A. Mahillo would like to thank CSIC for its financial support through a JAE Intro scholarship. Ralph Abraham and Jerrold E. Marsden. Foundations of Mechanics. AMS Chelsea Publishing, Redwood City, CA, 2 edition, 1978. P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008. ISBN 978-0-691-13298-3. doi: 10.1515/ 9781400830244. URL https://doi.org/10.1515/9781400830244. With a foreword by Paul Van Dooren. Hedy Attouch, Zaki Chbani, and Hassan Riahi. Fast convex optimization via time scaling of damped inertial gradient dynamics. Pure Appl. Funct. Anal., 6(6):1081 1117, 2021. ISSN 2189-3756. Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On symplectic optimization. ar Xiv, 1802.03653, 2018. JeffBezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to numerical computing. SIAM review, 59(1):65 98, 2017. URL https://doi.org/10. 1137/141000671. Sergio Blanes and Fernando Casas. A concise introduction to geometric numerical integration. Monographs and Research Notes in Mathematics. CRC Press, Boca Raton, FL, 2016. ISBN 978-1-4822-6342-8. L. M. Br egman. A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. ˇZ. Vyˇcisl. Mat i Mat. Fiz., 7:620 631, 1967. ISSN 0044-4669. Tom Breloff. Plots.jl, May 2021. URL https://doi.org/10.5281/zenodo.4776893. C edric M. Campos. High order variational integrators: a polynomial approach. In Advances in differential equations and applications, volume 4 of SEMA SIMAI Springer Ser., pages Discrete Variational Calculus for Accelerated Optimization 249 258. Springer, Cham, 2014. doi: 10.1007/978-3-319-06953-1\ 24. URL https:// doi.org/10.1007/978-3-319-06953-1_24. C edric M. Campos and J. M. Sanz-Serna. Palindromic 3-stage splitting integrators, a roadmap. J. Comput. Phys., 346:340 355, 2017. ISSN 0021-9991. doi: 10.1016/j.jcp. 2017.06.006. URL https://doi.org/10.1016/j.jcp.2017.06.006. C edric M. Campos. Personal website, 2022a. URL https://cmcampos.xyz. C edric M. Campos. Research repository, 2022b. URL https://github.com/cmcampos-xyz. Beniamino Cappelletti-Montano, Antonio De Nicola, and Ivan Yudin. A survey on cosymplectic geometry. Rev. Math. Phys., 25(10):1343002, 55, 2013. ISSN 0129-055X. doi: 10.1142/S0129055X13430022. URL https://doi.org/10.1142/S0129055X13430022. Agustin-Louis Cauchy. M ethode g en erale pour la r esolution des syst emes d equations simultan ees. C. R. Acad. Sci., 25:536 -538, 1847. URL https://gallica.bnf.fr/ark: /12148/bpt6k2982c/f540.item. Elena Celledoni, Matthias J. Ehrhardt, Etmann Christian, Robert I Mc Lachlan, Brynjulf Owren, Carola-Bibiane Sch onlieb, and Sherry Ferdia. Structure preserving deep learning. ar Xiv, 2006.03364, 2020. URL https://arxiv.org/abs/2006.03364. Manuel de Le on and Paulo R. Rodrigues. Methods of Differential Geometry in Analytical Mechanics, volume 158. Elsevier, Amsterdam, 1987. ISBN 0-08-087269-7. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http:// archive.ics.uci.edu/ml. Valentin Duruisseaux, Jeremy Schmitt, and Melvin Leok. Adaptive Hamiltonian variational integrators and applications to symplectic accelerated optimization. SIAM J. Sci. Comput., 43(4):A2949 A2980, 2021. ISSN 1064-8275. doi: 10.1137/20M1383835. URL https://doi.org/10.1137/20M1383835. E. Hairer, C. Lubich, and G. Wanner. Geometric numerical integration, volume 31 of Springer Series in Computational Mathematics. Springer, Heidelberg, 2010. ISBN 978-3642-05157-9. Structure-preserving algorithms for ordinary differential equations, Reprint of the second (2006) edition. P. Hartman. Ordinary differential equations, volume 38 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2002. ISBN 0-89871-510-5. doi: 10.1137/1.9780898719222. URL http://dx.doi.org/10.1137/1. 9780898719222. Corrected reprint of the second (1982) edition [Birkh auser, Boston, MA; MR0658490 (83e:34002)], With a foreword by Peter Bates. Michael I. Jordan. Dynamical symplectic and stochastic perspectives on gradient-based optimization. In Proceedings of the International Congress of Mathematicians Rio de Janeiro 2018. Vol. I. Plenary lectures, pages 523 549. World Sci. Publ., Hackensack, NJ, 2018. C.M. Campos, A. Mahillo and D. Mart ın de Diego Taeyoung Lee, Molei Tao, and Melvin Leok. Variational symplectic accelerated optimization on lie groups. ar Xiv, 2103.14166, 2021. URL https://arxiv.org/abs/2103.14166. Paulette Libermann. Sur les automorphismes infinit esimaux des structures symplectiques et des structures de contact. In Colloque G eom. Diff. Globale (Bruxelles, 1958), pages 37 59. Centre Belge Rech. Math., Louvain, 1959. Paulette Libermann and Charles-Michel Marle. Symplectic geometry and analytical mechanics, volume 35 of Mathematics and its Applications. D. Reidel Publishing Co., Dordrecht, 1987. ISBN 90-277-2438-5. doi: 10.1007/978-94-009-3807-6. URL https: //doi.org/10.1007/978-94-009-3807-6. Translated from the French by Bertram Eugene Schwarzbach. J. C. Marrero, D. Mart ın de Diego, and E. Mart ınez. On the exact discrete lagrangian function for variational integrators: theory and applications, 2016. URL https://arxiv. org/abs/1608.01586. J. E. Marsden and M. West. Discrete mechanics and variational integrators. Acta Numer., 10:357 514, 2001. ISSN 0962-4929. doi: 10.1017/S096249290100006X. URL http://dx. doi.org/10.1017/S096249290100006X. H akon Marthinsen and Brynjulf Owren. Geometric integration of non-autonomous linear Hamiltonian problems. Adv. Comput. Math., 42(2):313 332, 2016. ISSN 1019-7168. doi: 10.1007/s10444-015-9425-0. URL https://doi.org/10.1007/s10444-015-9425-0. Patrick Kofod Mogensen, Kristoffer Carlsson, S ebastien Villemot, Spencer Lyon, Matthieu Gomez, Christopher Rackauckas, Tim Holy, David Widmann, Tony Kelman, Daniel Karrasch, Antoine Levitt, Asbjørn Nilsen Riseth, Carlo Lucibello, Changhyun Kwon, David Barton, Julia Tag Bot, Mateusz Baran, Miles Lubin, Sarthak Choudhury, Simon Byrne, Simon Christ, Takafumi Arakaki, Troels Arnfred Bojesen, benneti, and Miguel Raz Guzm an Macedo. Julianlsolvers/nlsolve.jl: v4.5.1, December 2020. URL https://doi.org/10.5281/zenodo.4404703. Yu. E. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2). Dokl. Akad. Nauk SSSR, 269(3):543 547, 1983. ISSN 0002-3264. Yurii Nesterov. Lectures on convex optimization, volume 137 of Springer Optimization and Its Applications. Springer, Cham, 2018. ISBN 978-3-319-91577-7; 9783-319-91578-4. doi: 10.1007/978-3-319-91578-4. URL https://doi.org/10.1007/ 978-3-319-91578-4. Second edition of [ MR2142598]. G. W. Patrick and C. Cuell. Error analysis of variational integrators of unconstrained Lagrangian systems. Numer. Math., 113(2):243 264, 2009. ISSN 0029-599X. doi: 10. 1007/s00211-009-0245-3. URL http://dx.doi.org/10.1007/s00211-009-0245-3. Elijah Polak. Optimization, volume 124 of Applied Mathematical Sciences. Springer-Verlag, New York, 1997. ISBN 0-387-94971-2. doi: 10.1007/978-1-4612-0663-7. URL https: //doi.org/10.1007/978-1-4612-0663-7. Algorithms and consistent approximations. Discrete Variational Calculus for Accelerated Optimization Boris T. Polyak. Some methods of speeding up the convergence of iterative methods. ˇZ. Vyˇcisl. Mat i Mat. Fiz., 4:791 803, 1964. ISSN 0044-4669. Boris T. Polyak. Introduction to optimization. Translations Series in Mathematics and Engineering. Optimization Software, Inc., Publications Division, New York, 1987. ISBN 0-911575-14-6. Translated from the Russian, With a foreword by Dimitri P. Bertsekas. H. H. Rosenbrock. An automatic method for finding the greatest or least value of a function. Comput. J., 3:175 184, 1960/61. ISSN 0010-4620. doi: 10.1093/comjnl/3.3.175. URL https://doi.org/10.1093/comjnl/3.3.175. J. M. Sanz-Serna and M. P. Calvo. Numerical Hamiltonian problems, volume 7 of Applied Mathematics and Mathematical Computation. Chapman & Hall, London, 1994. ISBN 0-412-54290-0. B Shi, Du S. S., Jordan M.I., and Su J.U. Acceleration via symplectic discretization of high-resolution differential equations. ar Xiv, 2019. URL https://arxiv.org/pdf/1902. 03694.pdf. Weijie Su, Stephen Boyd, and Emmanuel J. Cand es. A differential equation for modeling nesterov s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1 43, 2016. URL http://jmlr.org/papers/v17/15-084.html. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David Mc Allester, editors, Advances in Neural Information Processing Systems, volume 28 of Proceedings of Machine Learning Research, pages 1139 1147, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/sutskever13.html. M. Vaquero, Mestres P., and J: Cort es. Resource-aware discretization of accelerated optimization flows. Preprint, 2021. URL http://carmenere.ucsd.edu/jorge/ publications/data/2020_Va Me Co-tac.pdf. Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. USA, 113(47):E7351 E7358, 2016. ISSN 0027-8424. doi: 10.1073/pnas.1614734113. URL https://doi.org/10.1073/ pnas.1614734113. Andre Yohannes Wibisono. Variational and Dynamical Perspectives On Learning and Optimization. Pro Quest LLC, Ann Arbor, MI, 2016. ISBN 978-1369-05764-5. Thesis (Ph.D.) University of California, Berkeley.