# discrete_variational_calculus_for_accelerated_optimization__812c24a3.pdf

Journal of Machine Learning Research 24 (2023) 1-33 Submitted 11/21; Revised 11/22; Published 1/23

Discrete Variational Calculus for Accelerated Optimization

C edric M. Campos cedric.mcampos@urjc.es Departamento de Matem atica Aplicada, Ciencia e Ingenier ıa de los Materiales y Tecnolog ıa Electr onica Universidad Rey Juan Carlos Calle Tulip an s/n, 28933 M ostoles, Spain

Alejandro Mahillo almahill@unizar.es Departamento de Matem aticas Instituto Universitario de Matem aticas y Aplicaciones Universidad de Zaragoza Calle de Pedro Cerbuna 12, 50009 Zaragoza, Spain

David Mart ın de Diego david.martin@icmat.es Instituto de Ciencias Matem aticas (CSIC-UAM-UC3M-UCM) Calle Nicol as Cabrera 13-15, 28049 Madrid, Spain

Editor: Sayan Mukluk

Many of the new developments in machine learning are connected with gradient-based optimization methods. Recently, these methods have been studied using a variational perspective (Betancourt et al., 2018). This has opened up the possibility of introducing variational and symplectic methods using geometric integration. In particular, in this paper, we introduce variational integrators (Marsden and West, 2001) which allow us to derive diﬀerent methods for optimization. Using both Hamilton s and Lagrange-d Alembert s principle, we derive two families of optimization methods in one-to-one correspondence that generalize Polyak s heavy ball (Polyak, 1964) and Nesterov s accelerated gradient (Nesterov, 1983), the second of which mimics the behavior of the latter reducing the oscillations of classical momentum methods. However, since the systems considered are explicitly time-dependent, the preservation of symplecticity of autonomous systems occurs here solely on the ﬁbers. Several experiments exemplify the result. Keywords: Polyak s heavy ball, Nesterov s accelerated gradient, momentum methods, variational integrators, Bregman Lagrangians

1. Introduction

Many of the literature on machine learning and data analysis is connected with gradientbased optimization methods (see Polak, 1997; Nesterov, 2018; and references therein). The computations often involve large data and parameter sets and then, not only the computational eﬃciency is a crucial point, but the optimization theory also plays a fundamental role. A typical optimization problem is:

argmin f(x) , x Q , (1.1)

2023 C edric M. Campos, Alejandro Mahillo and David Mart ın de Diego.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1323.html.

C.M. Campos, A. Mahillo and D. Mart ın de Diego

where we assume that Q is a convex set in Rn and f is a continuously diﬀerentiable convex function with Lipschitzian gradient. In this case, one of the most extended algorithms to achieve (1.1) is Nesterov s accelerated gradient (Nesterov, 1983; Su et al., 2016) which may take the following form:

yk+1 = xk η f(xk) ,

xk+1 = yk+1 + k k + 3(yk+1 yk) ,

starting from an initial condition x0 (see more details in Sections 2 and 7). An important observation was made by Su et al. (2016) showing that the continuous limit of Nesterov s method is a time-dependent second order diﬀerential equation. Moreover, Wibisono et al. (2016) show that this system of diﬀerential equations has a variational origin (see also Wibisono, 2016). In particular, they take as point of departure this variational approach that captures acceleration in continuous time considering a particular type of timedependent Lagrangian functions, called Bregman Lagrangians (see Section 3). In a recent paper, Betancourt et al. (2018) introduce symplectic (and presymplectic) integrators for the diﬀerential equations associated with accelerated optimizations methods (see Sanz-Serna and Calvo, 1994; Hairer et al., 2010; Blanes and Casas, 2016, for an introduction to symplectic integration). They use the Hamiltonian formalism since it is possible to extend the phase space to turn the system into a time-independent Hamiltonian system and apply there standard symplectic techniques (see Marthinsen and Owren, 2016; Celledoni et al., 2020). For recent improvements of this approach using adaptive Hamiltonian variational integrators, see Duruisseaux et al. (2021). In our paper we set an alternative route: The idea is to use variational integrators adapted to an explicit time-dependent framework and external forces (see Marsden and West, 2001, and references therein) to derive a whole family of optimizations methods. The theory of discrete variational mechanics has reached maturity in recent years by combining results of diﬀerential geometry, classical mechanics, and numerical integration. Roughly speaking, the continuous Lagrangian L: TQ R is substituted by a discrete Lagrangian Ld : Q Q R. Observe that, by replacing the standard velocity phase space TQ with Q Q, we are discretizing a velocity vector by two (in principle) close points. With the unique information of the discrete Lagrangian we can deﬁne the discrete action sum and, applying standard variational techniques, we derive a system of second order diﬀerence equations known as discrete Euler-Lagrange equations. The numerical order of the methods is obtained using variational error analysis (see Marsden and West, 2001; Patrick and Cuell, 2009). Moreover, it is possible to derive a discrete version of Noether s theorem relating the symmetries of the discrete Lagrangian with conserved quantities. The derived methods are automatically symplectic and, perhaps more importantly, easily adapted to other situations as, for instance, Lie group integrators, time-dependent Lagrangians, forced systems, optimal control theory, holonomic and nonholonomic mechanics, ﬁeld theories, etc. The Lagrangian functions described in Section 3, Bregman Lagrangians, are those explicitly time-dependent that typically arise on accelerated optimization. The geometry for time-dependent systems is diﬀerent from symplectic geometry, in particular, the phase space is odd dimensional. In this case, an appropriate geometric framework is given by cosymplectic geometry (see Libermann, 1959; Cappelletti-Montano et al., 2013; and ref-

Discrete Variational Calculus for Accelerated Optimization

erences therein). In Section 4, we introduce the cosymplectic structure associated with a time-dependent Hamiltonian system (induced by a time-dependent Lagrangian) and also an interesting symplectic preservation property associated with the restriction of the Hamiltonian ﬂow to the ﬁbers of the projection onto the time variable (Theorem 1). Having in mind this geometrical framework, we introduce in Section 5 discrete variational mechanics for time-dependent Lagrangians with ﬁxed time step (compare with Marsden and West, 2001, for variable time step). Moreover, we recover the symplectic character on ﬁbers of the continuous Hamiltonian ﬂow. We show the feasibility of constructing variational integrators using similar techniques to the developed for the autonomous case that, in some interesting cases, are in addition explicit and, consequently, reduce the computational cost. An example of such methods is the second-order diﬀerence equation

xk+1 = xk ηk f(xk) + µk(xk xk 1) ,

a type of momentum-descent method widely studied in the literature and whose origin goes back to Polyak (1964). Momentum methods allow to accelerate gradient descent by taking into account the speed achieved by the method at the last update. However, because of that speed, momentum methods can overpass the minimum. Nesterov s method tries to anticipate future information reducing the typical oscillations of classical momentum methods towards the minimum. In Section 6 we adapt our construction of variational integrators to add external forces using discrete Lagrange-d Alembert s principle (see Marsden and West, 2001). Upon this machinery, we derive in Section 7 two families of momentum methods in mutual bijective correspondence, one of which corresponds to Nesterov s method (see Theorem 6). Finally, for Section 8, many methods and numerical simulations have been implemented in Julia v1.8.2. We optimize several test functions with our methodology and other methods that appeared recently in the literature. One of the test functions is reused afterwards for a machine learning example.

2. From Gradient Descent to Nesterov s Accelerated Gradient

In this section we give a historical perspective of Nesterov s accelerated gradient from gradient descent with a threefold objective: First, properly introduce the methods of interest and their properties, second, give an overall view of the elements to take under consideration, and, third, set some of the notation. Although the ﬁrst method that comes to mind to solve the optimization problem (1.1) is Newton-Raphson, the ﬁrst dynamical one is due to Cauchy (1847). His method, known as Gradient Descent (GD), is the one-step method

xk+1 = xk η f(xk) , (2.1)

where η is the step size parameter or, as it is referred in the machine learning community, the learning rate. It is readily seen that this method is a simple discretization of the ﬁrst order ODE x = f(x) , (2.2)

from which it takes its dynamical nature. What is perhaps not so readily seen is that, given an initial condition x0, the trajectories obtained from both equations, xk and x(t), converge

C.M. Campos, A. Mahillo and D. Mart ın de Diego

to the argument of minima x . In particular, xk converges linearly to x while the function values f(xk) do so to the global minimum f(x ) at a rate of O(1/k) (Polyak, 1964, 1987). An initial improvement over GD was given by Polyak (1964): He introduced a novel two-step method, Polyak s Heavy Ball (PHB), also known as Classical Momentum (CM) after Sutskever et al. (2013). As it was originally presented, PHB/CM takes the form of the two-step method xk+1 = xk ηP(xk) + µ(xk xk 1) , (2.3)

where P is a functional operator for which a root is sought and µ, η are small positive constants that condition the convergence of the method. In comparison with (2.1), (2.3) adds a new term, xk xk 1, the momentum of the discrete motion that incorporates past information in an amount controlled by µ, the so called momentum coeﬃcient. When P is conservative, that is, when P = f, Polyak showed that, although the method s trajectory still converges linearly as GD s, it does so faster than GD s, that is, with a smaller geometric ratio (Polyak, 1964, 1987). The continuous analogue of (2.3) is the second order ODE

x + ν(t) x + η(t)P(x) = 0 , (2.4)

that turns out to be the equation of motion of a Lagrangian system when P = f (Lemma 4). Then x(t) traces the motion of a point mass in a well given by f. We therefore drop P and stick hereon with f. A further an crucial step towards improving GD (and PHB/CM) was given in 1983 by Nesterov, a former student of Polyak. He presented a new method, coined after him as Nesterov s Accelerated Gradient (NAG), similar to PHB/CM but with a slight change of unexpected consequences. A naive derivation from (2.3) is almost immediate: Introduce a new variable yk in (2.3) so it can be easily rewritten as the equivalent method

yk+1 = xk ηk f(xk) , (2.5a)

xk+1 = yk+1 + µk(xk xk 1) , (2.5b)

where discrete-time dependence has been added to the coeﬃcients µ, η for convenience. Replace the x s of the momentum term (right hand side of the second equation, Eq. 2.5b) by y s to get the new and non-equivalent method

yk+1 = xk ηk f( xk) , (2.6a)

xk+1 = yk+1 + µk( yk+1 yk) , (2.6b)

where the bars are added to distinguish more easily both methods and underline that the sequences of points that they deﬁne are in fact diﬀerent. This latter method (2.6) is NAG as it is usually presented. Nesterov showed in turn that his method accelerates the convergence rate of the function values down to O(1/k2) (see Nesterov, 1983, 2018). The original values of ηk, µk given by Nesterov are rather intricate, a simpler and commonly used version is

yk+1 = xk η f( xk) , (2.7a)

xk+1 = yk+1 + k k + 3( yk+1 yk) , (2.7b)

Discrete Variational Calculus for Accelerated Optimization

with η > 0 constant. As it is shown in Su et al. (2016), a continuous analogue of (2.7) is

t x + f(x) = 0 , (2.8)

which is but a particular case of PHB/CM s continuous analogue (2.4). Besides that, Su et al. (2016) also show that the function values converge to the minimum at an inverse quadratic rate, that is, f(x(t)) = f(x ) + O(1/t2). More generally (Remark 13), (2.6) is a natural discretization of a perturbed ODE of the form x + ν(t) x + η(t) f(x) = εF(x, x, t) , (2.9)

which also is the equation of motion of a Lagrangian system (Lemma 4). In fact, it is this variational origin that Wibisono et al. (2016) take as point of departure. Once a particular type of time-dependent Lagrangian functions is considered, a subfamily of the so called Bregman Lagrangians, the variational approach captures acceleration in continuous-time into the derived discrete schemes achieving, in this case, a function value convergence rate of O(t1 n) with n 3 (see also Wibisono, 2016).

3. Bregman Lagrangians

A Bregman Lagrangian is roughly speaking a time-dependent mechanical Lagrangian whose kinetic part is close to be a metric. They are built upon Bregman divergences (Br egman, 1967), a particular case of divergence functions. Bregman Lagrangians allow to deﬁne variational problems whose solutions minimize an objective function at an exponential rate (Betancourt et al., 2018). A divergence function over a manifold Q is a twice diﬀerentiable function B: Q Q R+ such that for all x, y Q we have:

B(x, y) 0 and B(x, x) = 0;

x B(x, x) = y B(x, x); and

2 xy B(x, x) is negative-deﬁnite.

Divergence functions appear as pseudo-distances that are non-negative but are not, in general, symmetric. A typical divergence function over Q = Rn associated with a diﬀerentiable strictly convex function Φ: Rn R is the Bregman divergence:

BΦ(x, y) = Φ(x) Φ(y) dΦ(y), x y .

Observe that it is the remainder of the ﬁrst order Taylor expansion of Φ around y evaluated at x, a sort of Hessian metric. Given a Bregman divergence over Rn, let us consider the time-dependent kinetic energy

K(x, x, t) = BΦ(x + e α(t) x, x) ,

and the time-dependent potential energy

U(x, t) = eβ(t)f(x) ,

C.M. Campos, A. Mahillo and D. Mart ın de Diego

from which we deﬁne the Bregman Lagrangian L: TRn TR R by

L(x, x, t) = eα(t)+γ(t) K(x, x, t) U(x, t)

= eα(t)+γ(t) Φ(x + e α(t) x) Φ(x) e α(t) dΦ(x), x eβ(t)f(x) ,

where the time-dependent functions α(t), β(t), γ(t) are chosen to produce diﬀerent algorithms. These functions verify what Wibisono et al. (2016) refer to as ideal scaling conditions, namely, γ(t) = eα(t) and β(t) eα(t) . (3.1)

The ﬁrst condition greatly simpliﬁes several expressions that can be derived from the Bregman Lagrangian. For instance, when γ(t) = eα(t) is met, the associated Euler-Lagrange equations reduce to

2Φ x + e α(t) x d

x + e α(t) x + eα(t)+β(t) f(x) = 0 .

The second condition ensures convergence of the underlying trajectories to the minimum at a rate not slower than O(e β(t)). In the particular case where Φ(x) = 1

2 x 2, for which BΦ(x, y) = 1

2 x y 2, the Bregman Lagrangian takes the simple form

L(x, x, t) = a(t)1

2 x 2 b(t)f(x) , (3.2)

with a(t) = eγ(t) α(t) and b(t) = eα(t)+β(t)+γ(t).

4. Geometry of the Time-Dependent Lagrangian and Hamiltonian Formalisms

Since Bregman Lagrangians are time-dependent, in this section, we introduce some needed geometric ingredients about non-autonomous mechanics and highlight some of their main invariance properties (see Abraham and Marsden, 1978; Libermann and Marle, 1987; de Le on and R. Rodrigues, 1987). Let Q be a manifold and TQ its tangent bundle. Coordinates (xi) on Q induce coordinates (xi, xi) on TQ. Therefore we have natural coordinates (xi, xi, t) on TQ R which is the velocity phase space for time-dependent systems. Given two instants (time values) a, b R, with a < b, and corresponding positions xa, xb Q, consider the set of curves:

C2 a,b = C2([a, b], xa, xb) = {σ: [a, b] Q | σ C2 with σ(a) = xa, σ(b) = xb} .

Given a time-dependent Lagrangian function L: TQ R R, deﬁne the action functional JL : C2 a,b R

JL(σ) = Z b

a L(σ (t), t) dt , (4.1)

where σ : [a, b] TQ.

Discrete Variational Calculus for Accelerated Optimization

Using variational calculus, the critical points of JL are locally characterized by the solutions of the Euler-Lagrange equations:

xi = 0 , 1 i n = dim Q . (4.2)

For time-dependent Lagrangians it is possible to check that the energy EL : TQ R R,

EL = L L = xi L

where is the Liouville vector ﬁeld on TQ (Libermann and Marle, 1987), is not, in general, preserved since d EL

We now pass to the Hamiltonian formalism using the Legendre transformation

FL: TQ R T Q R ,

where T Q is the cotangent bundle of Q whose natural coordinates are (xi, pi). The Legendre transformation is locally given by

FL(xi, xi, t) = xi, L

We assume that the Legendre transformation is a diﬀeomorphism (that is, the Lagrangian is hyperregular) and deﬁne the Hamiltonian function H : T Q R R by

H = EL (FL) 1 ,

which induces the cosymplectic structure (ΩH, ηR) on T Q R with

dt := pr 2 dt , ΩH = d(pr 1 θQ Hdt) = ΩQ + d H dt ,

where pri, i = 1, 2, are the projections to each Cartesian factor and θQ denotes the Liouville 1-form on T Q (Abraham and Marsden, 1978), given in induced coordinates by θQ = pi dxi. We also denote by ΩQ = d pr 1 θQ the pullback of the canonical symplectic 2-form ωQ = dθQ on T Q. In coordinates, ΩQ = dxi dpi. (Observe that now ΩQ is presymplectic since ker ΩQ = span{ / t}.) Therefore in induced coordinates (xi, pi, t):

ΩH = dxi dpi + d H dt , ηR = dt .

We deﬁne the evolution vector ﬁeld EH X(T Q R) by

i EHΩH = 0 , i EHdt = 1 . (4.3)

In local coordinates the evolution vector ﬁeld is:

C.M. Campos, A. Mahillo and D. Mart ın de Diego

The integral curves of EH are given by:

t = 1 , xi = H

pi , pi = H

The integral curves of EH are precisely the curves of the form t 7 FL(σ (t), t) where σ: I Q is a solution of the Euler-Lagrange equations for L: TQ R R. From Equation (4.3) we deduce that the ﬂow of EH veriﬁes the following preservation properties LEHΩH = LEH(ΩQ + d H dt) = 0 , LEHdt = 0 . (4.5)

Denote by Ψs : U T Q R T Q R the ﬂow of the evolution vector ﬁeld EH, where U is an open subset of T Q R. Observe that

Ψs(αq, t) = (Ψt,s(αq), t + s) , αq T q Q ,

where Ψt,s(αq) = pr1(Ψs(αq, t)). Therefore from the ﬂow of EH we induce a map

Ψt,s : Ut T Q T Q ,

where Ut = {αq T Q | (αq, t) U}. Observe that if we know Ψt,s for all t, we can recover the ﬂow Ψs of EH. From Equations (4.5) we deduce that

Ψ s(ΩQ + d H dt) = ΩQ + d H dt , Ψ s(dt) = dt . (4.6)

The following theorem relates the preservation properties (4.6) with the symplecticity of the map family {Ψt,s : T Q T Q}.

Theorem 1 We have that Ψt,s : Ut T Q T Q is a symplectomorphism, that is, Ψ t,sωQ = ωQ.

Proof First, observe that any vector Y(αq,t) T(αq,t)(T Q R) admits a unique decomposition: Y(αq,t) = Yαq(t) + Yt(αq) ,

where Yαq(t) Tαq T Q and Yt(αq) Tt R. Moreover, we have that dt, Yαq(t) = 0. Therefore, if we restrict ourselves to vectors tangent to the pr2-ﬁbers T(αq,t) pr 1 2 (t) = V(αq,t) pr2 then we have the decomposition

Y(αq,t) = Yαq(t) + 0t = Yαq(t) V(αq,t) pr2 Tαq T Q .

From the second preservation property given in (4.5) we deduce that

0 = (dt)(αq,t), Yαq(t) = (Ψ sdt)(αq,t), Yαq(t) = (dt)Ψs(αq,t), TΨs(Yαq(t)) .

Therefore TΨs(Yαq(t)) V(Ψs,t(αq),t+s)pr2 TΨt,s(αq)T Q and

TΨs(Yαq(t)) = TΨt,s(Yαq(t)) + 0t+s TΨt,s(Yαq(t)) .

Discrete Variational Calculus for Accelerated Optimization

Using the ﬁrst identity in (4.5) we deduce that

(ωQ)αq(Yαq(t), Yαq(t)) = (ΩQ + d H dt)(αq,t)(Yαq(t), Yαq(t))

= (ΩQ + d H dt)(Ψs,t(αq),t+s)(TΨs(Yαq(t)), TΨs( Yαq(t)))

= (ωQ)Ψs,t(αq)(TΨt,s(Yαq(t)), TΨt,s( Yαq(t)))

where Yαq(t), Yαq(t) Tαq T Q T(αq,t) pr 1 2 (t). We conclude that Ψ t,sωQ = ωQ.

5. Discrete Variational Methods for Time-Dependent Lagrangian Systems

Consider the set of discrete paths (or sequences) on Q for a ﬁxed number of steps N N, that is, the set

Cd(Q) = {xd : {0, 1, . . . , N} Q} = Q (N+1) Q .

Then an appropriate discrete interpretation of the velocities are pairs in Q Q, a discrete version of TQ. A discrete time-dependent Lagrangian is a family of maps

Lk d : Q Q R , k N ,

for which we deﬁne the discrete action map on the space of sequences as

k=0 Lk d(xk, xk+1) , xd Cd(Q) .

If we consider variations of xd with ﬁxed end points x0 and x N and extremize Sd over x1, . . . , x N 1, we obtain the discrete Euler-Lagrange equations (DEL for short)

xk Sd(xd) = D1Lk d(xk, xk+1) + D2Lk 1 d (xk 1, xk) = 0 , for all k = 1, . . . , N 1,

where D1 and D2 denote the partial derivatives with respect to the ﬁrst and second components, respectively. If, for all k, Lk d is regular, that is, the matrix

D12Lk d = 2Lk d xk xk+1

is non-singular, then we locally obtain a well deﬁned family of discrete Lagrangian maps:

Fk,k+1 : Q Q Q Q (xk, xk+1) 7 (xk+1, xk+2(xk, xk+1, k)) .

where the value of xk+2 is determined in terms of xk, xk+1 and k. In this setting, we can deﬁne two discrete Legendre transformations associated with Lk d, F Lk d : Q Q T Q, by the expressions

F+Lk d : (xk, xk+1) 7 (xk+1, D2Lk d(xk, xk+1)) ,

F Lk d : (xk, xk+1) 7 (xk, D1Lk d(xk, xk+1)) .

C.M. Campos, A. Mahillo and D. Mart ın de Diego

We can also deﬁne the evolution of the discrete system on the Hamiltonian side, Fk,k+1 : T Q T Q, by any of the formulas:

Fk,k+1 = F+Lk d (F Lk d) 1 = F+Lk d Fk 1,k (F+Lk 1 d ) 1 = F Lk+1 d Fk,k+1 (F Lk d) 1 ,

thanks to the commutativity of the following diagram.

(xk 1, xk) Fk 1,k

(xk, xk+1) F Lk d

Fk,k+1 / (xk+1, xk+2) F Lk+1 d

(xk, pk) Fk,k+1 / (xk+1, pk+1)

Proposition 2 The discrete Hamiltonian map Fk,k+1 : (T Q, ωQ) (T Q, ωQ) is a symplectic transformation, that is,

( Fk,k+1) ωQ = ωQ .

Proof Using similar arguments to the autonomous case (Marsden and West, 2001), we deduce that (F+Lk d) ωQ = (F Lk d) ωQ .

From the deﬁnition of Fk,k+1 : T Q T Q we deduce that

( Fk,k+1) ωQ = (F+Lk d (F Lk d) 1) ωQ = ((F Lk d) ) 1((F+Lk d) ωQ) = ωQ .

Given the map Fk,k+1(qk, pk) = (qk+1, pk+1), we immediately have the map

(xk, pk, kh) 7 (xk+1, pk+1, (k + 1)h)

on T Q R where we now give explicit information of the evolution of discrete time. Let see the relation of these discrete maps Fk,k+1 : Q Q Q Q and Fk,k+1 : T Q T Q with the Euler-Lagrange equations and Hamilton equations of a time-dependent Lagrangian system. Given a regular Lagrangian function L: TQ R R and a suﬃciently small time step h > 0, we are going to deﬁne an h-and-k-dependent family of discrete Lagrangian functions Lk d,h : Q Q R as an inﬁnitesimal approximation to the continuous action JL deﬁned in expression (4.1). As intermediate step, we ﬁrst consider the exact time-dependent discrete Lagrangian associated with a regular Lagrangian L which is given by the expression

Lk,E d,h (x0, x1) = 1

kh L(x0,1(t), x0,1(t), t) dt ,

where x0,1(t) is the unique solution of the Euler-Lagrange equations for L with x0,1(kh) = x0 and x0,1((k + 1)h) = x1, (see Hartman, 2002; Marrero et al., 2016). Then for a suﬃciently

Discrete Variational Calculus for Accelerated Optimization

small h, the solutions of the DEL for Lk,E d,h lie on the solutions of the Euler-Lagrange equations for L (Theorem 1.6.4, Marsden and West, 2001). In practice, Lk,E d,h (x0, x1) will not be available, therefore we take an approximation,

Lk d,h(x0, x1) Lk,E d,h (x0, x1) ,

using some quadrature rule. Then, as we have seen, the scheme derived from the DEL will be geometric integrators for Equations (4.4) preserving the symplectic form in the sense of Theorem 1 (see Patrick and Cuell, 2009).

Remark 3 As we have commented in the Introduction, one of the main advantages of the proposed approach is the possibility to use other options to derive diﬀerent numerical methods for optimization by only discretizing a unique function, the action functional. Of course, there are many diﬀerent ways to do it (Marsden and West, 2001). For instance, we can combine several discrete Lagrangians together to obtain a new discrete Lagrangian with higher order (composition methods) or similarly obtaining splitting methods (Campos and Sanz-Serna, 2017). Also, we can easily derive symplectic partitioned Runge-Kutta methods or symplectic Garlekin methods using polynomial approximations of the trajectories and a numerical quadrature to approximate the action integral (Campos, 2014). Moreover, it is possible to adapt the variational integrators to a non-euclidean setting using appropriate retraction maps.

6. Discretization of Lagrangian Systems with Forces

Our intention here is to continue looking for numerical approximations to the time-dependent Euler-Lagrange equations but considering additionally an external force that decreases jointly with the time-step parameter h. With it, we will obtain a whole family of algorithms whose behavior resembles that of the Nesterov method. Fortunately, discrete mechanics is also adapted to the case of external forces (see Marsden and West, 2001). To this end, in addition to a time-dependent Lagrangian function L: TQ R R, we have an external force given by a ﬁber preserving map F : TQ R T Q given locally by

F(x, x, t) = (xi, Fi(x, x, t)) .

Given the force f, we derive the equations of motion of the forced system modifying Hamilton s principle to the Lagrange-d Alembert principle, which seeks curves σ C2 a,b satisfying

a L(σ (t), t) dt + Z b

a F(σ (t), t)δσ(t) dt = 0 , (6.1)

for all δσ TσC2 a,b. Using integration by parts, we derive the forced Euler-Lagrange equations, which have the following coordinate expression:

To discretize these equations, we consider, as before, a family of Lagrangian functions Lk d : Q Q R and two discrete forces (F k d ) : Q Q T Q, which are ﬁber preserving

C.M. Campos, A. Mahillo and D. Mart ın de Diego

in the sense that πQ (F k d ) = pr , where pr (x , x+) = x . Combing both forces, we obtain F k d : Q Q T (Q Q) by

F k d (xk, xk+1), (δxk, δxk+1) = (F k d ) (xk, xk+1)δxk + (F k d )+(xk, xk+1)δxk+1 .

As in (6.1), we have a discrete version of the Lagrange-d Alembert principle for the discrete forced system given by Lk d and F k d :

k=0 Lk d(xk, xk+1) +

k=0 F k d (xk, xk+1), (δxk, δxk+1) = 0 ,

for all variations {δxk}N k=0 vanishing at the endpoints, that is, δx0 = δx N = 0. This is equivalent to the forced discrete Euler-Lagrange equations:

D1Lk d(xk, xk+1) + D2Lk 1 d (xk 1, xk) + (F k d ) (xk, xk+1) + (F k 1 d )+(xk 1, xk) = 0 , (6.2)

for all k = 1, . . . , N 1.

7. The Variational Derivation of PHB/CM and NAG

As seen in Section 2, NAG can be derived naively from PHB/CM. Besides, under suitable conditions on µ, η and the starting points, both methods converge to a minimum of f, the latter, NAG, doing so faster (Polyak, 1964; Nesterov, 1983). Questions arise: What makes NAG faster than PHB/CM? Can this be exploited to obtain even faster methods? Can it be generalized? Questions that boil down to how NAG is fundamentally derived from PHB/CM. To begin with, note that the NAG equations (2.6) can be rewritten only in terms of the x s, as in (2.3), or only in terms of the y s, yielding the equations

xk = µk xk 1 ηk f( xk) µk ηk f( xk) ηk 1 f( xk 1) , (7.1a)

yk = µk 1 yk 1 ηk f( yk + µk 1 yk 1) . (7.1b)

The ﬁrst, Eq. (7.1a), when compared with (2.3) shows an extra term,

µk ηk f( xk) ηk 1 f( xk 1) ,

that in fact points to the very origin of the method: an additional forcing term. The second, Eq. (7.1b), compared again with (2.3) shows that the y-trajectory is obtained almost as if it was computed by PHB/CM but evaluating f at a future point, yk + µk 1 yk 1, which informs better the method on how to advance towards the minimum. Besides the convergence towards the minimum, it can be shown that both methods are a time discretization of the second order diﬀerential equation (Su et al., 2016)

x + ν(t) x + η(t) f(x) = 0 , (7.2)

a well known fact in the literature, equation that furthermore is variational in general (see Lemma 4). A fact that is not so well known is that NAG discretizes better the equation

Discrete Variational Calculus for Accelerated Optimization

when including a force term proportional to the underlying time step and, moreover, it can be derived, as well as PHB/CM, from a variational approach, in the geometric integration sense (see Sections 5 and 6). We ﬁrst give a rather simple and direct result whose purpose is to establish properly the continuous setting over which the discretizations will be built and from which the methods can be derived.

Lemma 4 Given a vector ﬁeld P : Rn Rn, consider the second order diﬀerential equation

x + ν(t) x + η(t)P(x) = εd

h η(t)P(x) i , (7.3)

where ν, η: R+ R are continuous time-dependent real valued functions and where ε R is a constant. If P is conservative, that is, if P = f, then (7.3) corresponds to the equation of motion of the (forced) time-dependent Lagrangian system:

L(x, x, t) = a(t)1

2 x 2 b(t)f(x) , (7.4a)

F(x, x, t) = εa(t)d

a(t)P(x) , (7.4b)

in which f is the ﬁeld s potential and where

a(t) = exp( R t 0 ν(s)ds) , and b(t) = a(t)η(t) , (7.5)

Proof Assume P is conservative and let f denote its potential. Then the Euler-Lagrange equation for (7.4) is

a(t) x + a (t) x + b(t)P(x) = εa(t)d

a(t)P(x) . (7.6)

Dividing by a(t), and taking into account that, from (7.5) we have that

ν(t) = a (t)

a(t) and η(t) = b(t)

a(t) , (7.7)

we obtain (7.3).

Remark 5 P being conservative is not a necessary condition for (7.3) to be derived from the Lagrange-d Alembert principle. In order to be variational, Equation (7.6) requires a Lagrangian of the form

L(x, x, t) = a(t)1

2 x2 + c(x, t), x + d(x, t)

for some unknown functions c, d, which implies

t + b(t)P(x) .

A vector ﬁeld P satisfying this last relation need not be conservative.

C.M. Campos, A. Mahillo and D. Mart ın de Diego

Next, the result that links the previous continuous equation (7.3) with PHB/CM and NAG, showing, in particular, that NAG is a forced version of PHB/CM, between which the transition is immediate.

Theorem 6 Given a real valued function f : Rn R and a vector ﬁeld P : Rn Rn, consider the time-dependent discrete Lagrangian and forces

Lk d(z0, z1) = ak 1

2 z1 z0 2 b k f(z0) b+ k+1f(z1) , (7.8a)

(F k d ) (z0, z1) = ak 1

ak (b k + b+ k )P(z0) , and (7.8b)

(F k d )+(z0, z1) = (b k + b+ k )P(z0) . (7.8c)

where {ak}k 0, {b k }k 0, {b+ k }k 0, are arbitrary sequences of real numbers. If f is regular enough, so that P = f, and ak is never null, then the free and forced equations of motion for Lk d and (Lk d, (F k d ) , (F k d )+) are, respectively, equivalent to the following recursive schemes

yk+1 = xk ηk P(xk) , yk+1 = xk ηk P( xk) , (7.9a)

xk+1 = yk+1 + µk(xk xk 1) , xk+1 = yk+1 + µk( yk+1 yk) , (7.9b)

µk+1 = ak ak+1 and ηk = b k + b+ k ak , (7.10)

for k 0. Conversely, given a vector ﬁeld P : Rn Rn and two arbitrary sequences of real numbers {µk+1}k 0 and {ηk}k 0, consider the sequences of pairs of points given in equations (7.9a)- (7.9b). If P is conservative and µk+1 is never null, then both schemes are variational. Moreover, they are equivalent to the equations of motions for the free and forced timedependent discrete Lagrangian systems given in (7.8a)- (7.8c), for which f is the ﬁeld s potential and

a0 = 1, ak+1 = ak/µk+1, k 0 , b k = 1

2akηk, k 0 . (7.11)

Proof Partial diﬀerentiation of the Lagrangian gives

D1Lk d(z0, z1) = ak z0 b k f(z0) ,

D2Lk d(z0, z1) = ak z0 b+ k+1 f(z1) ,

from where it is readily seen that the DEL equations with forces (6.2) are

ak+1 z1 + ak z0 (b k+1 + b+ k+1) f(z1) =

= ak ak+1 (b k+1 + b+ k+1) f(z1) (b k + b+ k ) f(z0) ,

where the RHS is null for the non-forced DEL equations. Dividing by ak+1 and using the relations (7.10), we get

z1 µk+1 z0 + ηk+1 f(z1) = µk+1 (ηk+1 f(z1) ηk f(z0)) , (7.12)

Discrete Variational Calculus for Accelerated Optimization

where, again, the right hand side is null for the non-forced case. Replacing zi with xk+i in the non-forced case, and with xk+i in the forced one, taking into account that P = f, and using the equations in (7.9a), we obtain those in (7.9b). The converse is immediate.

Now remarks are in order that will summarize some points that have been mentioned earlier and underline others that haven not been yet.

Remark 7 (One-to-one correspondence) Note that both equations in (7.9a) are formally the same, whereas in (7.9b) there is a diﬀerence in the last term: While PHB/CM uses x s, NAG considers y s. This is a slight change that nonetheless deﬁnes diﬀerent schemes and in which the forcing term is hidden. This result not only shows that NAG is a forced version of PHB/CM, but that the schemes are in a natural bijective correspondence.

Remark 8 (Strategies) A speciﬁc method is deﬁned by a strategy: a pair of coeﬃcients (µ, η). This work and especially Theorem 6 focus on variational strategies, those pairs µ, η: N R that can be derived variationally from a time-dependent Lagrangian of the form (3.2). Being the original strategy of Polyak s method constant, it falls within this class of variational momentum-descent methods, whereas the original method by Nesterov is non-variational and belongs to a broader class of generalized momentum-descent methods where the coeﬃcients might depend on the objective function itself (confer with Polyak, 1964; Nesterov, 1983; see also Eq. 8.12).

Remark 9 (Initial conditions) Usually the initial condition, (x0, v0) or (x0, p0) in phase space, or (x0, x1) in conﬁguration space, is crucial for the proper simulation of the dynamical system. Here, however, the dynamics are a tool and generally an initial condition so that x1 = x0 or y0 = x0, where x0 = x0 is close to the minimum, will suﬃce, which corresponds to sticking the ball to the bowl s wall and leave it roll.

Remark 10 (Natural trajectory) From the schemes deﬁnitions, the sequences {xk} k=0 and { xk} k=0 are the natural (on track) dynamical trajectories towards the minimum of f, while {yk} k=1 and { yk} k=1 are oﬀroad marks that limit these trajectories like slalom ﬂags. The latter are, however, asymptotically close to the former and, hence, to the minimum.

Remark 11 (Discrete ﬂow) If one wants to compare the trajectories obtained from both methods, perhaps Equations (7.9) are better suited. If one is solely interested in a simple implementation to compute the minimum, then Equations (2.3) and (7.1b) are a good alternative since they can easily be rewritten to give discrete ﬂow updates in the form momentum ﬁrst, then position as in Sutskever et al. (2013):

xk = µk 1 xk 1 ηk f(xk) , yk = µk 1 yk 1 ηk f( yk + µk 1 yk 1) , (7.13a)

xk+1 = xk + xk , yk+1 = yk + yk , (7.13b)

where both methods should be initialized with x0 = y0 and x 1 = y 1 = 0. Both approaches, Equations (7.9) and Equations (7.13), have been considered for the simulations of Section 8.

C.M. Campos, A. Mahillo and D. Mart ın de Diego

Remark 12 (Force approximation) The action induced by the discrete forces in (7.8) is a second order approximation to the action induced by the continuous force (7.4b). Indeed, given continuous coeﬃcients a(t), b(t), deﬁne ak = a(kh)/h2 and b k so that b k +b+ k = b(kh), and similarly for a continuous path x(t) and a variation δx(t) of it. Then

k=0 F k d (xk, xk+1) (δxk, δxk+1) =

ak (b k + b+ k ) f(xk) δxk + h

k=1 (b k 1 + b+ k 1) f(xk 1) δxk

a(t) b(t) f(x(t)) + b(t h) f(x(t h)) δx(t) dt + O(h2)

h 2 h a(t h)d

a(t) f(x(t)) δx(t) dt + O(h2)

a(t) f(x(t)) δx(t) dt + O(h2)

where we have considered a mid point quadrature rule to establish the second equality, the rest follows.

Remark 13 (Force evanescence) As remarked, NAG can be viewed as an approximation to a forced continuous Lagrangian system, where the force is proportional to the time step that is used a posteriori for the discretization. Although its contribution is non-null along the whole trajectory, it however vanishes not only locally, when h decreases, but also asymptotically in time, when t or k increase.

8. Simulations

Numerical experiments are performed considering diﬀerent elements, namely:

The time-dependent coeﬃcients that appear in the Lagrangian, a(t) and b(t), for the simple case (3.2);

The discretization scheme used to approximate the Lagrangian action; and obviously,

The objective function to be minimized, f(x).

8.1 Lagrangian Coeﬃcients

We consider three diﬀerent Lagrangians or, more precisely, three diﬀerent pairs of timedependent coeﬃcients (a(t), b(t)) given, as in (3.2), by time-dependent exponents triples (α(t), β(t), γ(t)) satisfying the ideal scaling conditions (3.1), which ensure that argmin f is an attractor of the underlying dynamical system.

Discrete Variational Calculus for Accelerated Optimization

8.1.1 Potential Dilation

The ﬁrst Lagrangian under consideration is a potential dilation of a mechanical Lagrangian, namely,

L(x, x, t) = tn 1

2 x 2 f(x) , (8.1)

whose Euler-Lagrange equation is

t x + f(x) = 0 , (8.2)

which is the equation considered in Su et al. (2016). A naive choice of exponents from which to obtain the time-dependent coeﬃcients a(t) = b(t) = tn is α(t) = 0, β(t) = 0, γ(t) = n log t. (8.3)

However, they do not satisfy the ideal scaling conditions. A triple that does meet this requirement is

α(t) = log p log t, β(t) = 2(log t log p), γ(t) = p log t + log p, (8.4)

where p = n 1, but only when n 3, thus, showing the magic n = 3 in (8.2) of NAG (Su et al., 2016). It is, in fact, the only choice satisfying the scaling conditions and gives an optimal rate of convergence not slower than O(1/t2).

8.1.2 Modified Potential Dilation

The Lagrangian

L(x, x, t) = tn 1

2 x 2 Dtn 3f(x) , (8.5)

whose Euler-Lagrange equation is

t x + Dtn 3 f(x) = 0 ,

is the Lagrangian considered by Wibisono et al. (2016) for the metric case and corresponds to the exponents

α(t) = log p log t, β(t) = p log t + log C, γ(t) = p log t + log p, (8.6)

where C = D/p2 and p = n 1. They satisfy the ideal scaling conditions for n > 1, giving a rate of convergence not slower than O(1/tn 1) (confer with Wibisono et al., 2016, in particular for speciﬁc details on the constant C).

8.1.3 Exponential Dilation

Finally, the exponentially dilated Lagrangian

L(x, x, t) = eλt 1

2 x 2 f(x) , (8.7)

C.M. Campos, A. Mahillo and D. Mart ın de Diego

whose equation of motion is precisely the one of a mechanical system with linear damping,

x + λ x + f(x) = 0 , (8.8)

corresponds to the choice exponents

α(t) = log λ, β(t) = 2 log λ, γ(t) = λt + log λ. (8.9)

We note here three points. First, it is the unique choice for a(t) = b(t) = eλt that satisﬁes the ideal scaling conditions. Second, in principle this choice gives a theoretical convergence rate of O(1), fortunately it can be reduced down to o(1) (Attouch et al., 2021, Th. 2.5). Third and more notably, although the Lagrangian (8.7) is explicitly time-dependent, the Euler-Lagrange equation (8.8) is autonomous and it introduces a linear time-independent damping term in the equation.

8.2 Discretizations

Using the trapezoidal rule to approximate the action of the above Lagrangians, we retrieve common NAG coeﬃcients that appear in the literature. With some abuse of notation, we write in general

a(k) := a(tk) + a(tk+1)

2h2 , and b (k) := b(tk)

where tk = kh.

8.2.1 Bounded Coefficients from the Potential Dilation

For the continuous time-dependent coeﬃcients a(t) = b(t) = tn, with n = p + 1, the trapezoidal rule yields the discrete time-dependent coeﬃcients

a(k) = tn k + tn k+1 2h2 , b (k) = tn k 2 ,

from which we obtain the coeﬃcients

µ(k) = kn + (k 1)n

kn + (k + 1)n , η(k) = 2kn

kn + (k + 1)n h2. (8.10)

Both coeﬃcients are strictly increasing and bounded above by 1 and h2, respectively. To avoid integer overﬂow and slightly reduce computational cost in ﬁnal implementations, these expressions can be simpliﬁed to

µ(k) = 2k n

2k + n + o(1), η(k) = 2k 2k + n + o(1) h2.

For the particular case n = 3, that is p = 2, we get µ(k + 3/2) = k k+3 + O(1/k3) as in (2.7).

Discrete Variational Calculus for Accelerated Optimization

8.2.2 Unbounded Coefficients from the Modified Potential Dilation

In this case, the continuous time-dependent coeﬃcients are a(t) = tn and b(t) = Dt2n 3, as in Wibisono et al. (2016), which yield

µ(k) = kn + (k 1)n

kn + (k + 1)n , η(k) = D 2kn

kn + (k + 1)n tn 3 k h2. (8.11)

As earlier,

µ(k) = 2k n

2k + n + o(1), η(k) = D 2k 2k + n + o(1) tn 3 k h2.

This time, η is unbounded when n 4, and bounded above by Dh2 when n = 3. Similarly, Betancourt et al. (2018) suggest a palindromic split Hamiltonian method with 7 stages that can be recovered from the proposed perspective considering the discrete Lagrangian

Lk d(xk, xk+1) = tn k+1/2

2 Dtn 3 k+1/2 f(xk) + f(xk+1)

Note that it is not obtained by a trapezoidal approximation of the Lagrangian (8.5), but it is still a discretization of it for which

µ(k) = 2k 1

n , η(k) = D(2k + 1)2n 3 + (2k 1)2n 3

2(2k + 1)2n 3 tn 3 k+1/2h2.

η(k) = D 2k 2k + 2n 3 + o(1) tn 3 k+1/2h2 .

8.2.3 Constant Coefficients from the Exponential Dilation

Taking a(t) = b(t) = eλt yields

µ(k) = 1 + e λh

1 + e+λh , η(k) = 2 1 + eλh h2. (8.12)

For λ = 1 and h = 0.1024, µ 0.9 and η 0.01, values that often appear in the literature, as in Sutskever et al. (2013). In general, any pair of constant coeﬃcients µ, η > 0 can be obtained from values λ R, h > 0.

8.2.4 The Actual Method by Wibisono, Wilson, and Jordan

Strictly speaking, the method by Wibisono et al. (2016) is based on NAG, but it diﬀers from it. Although they consider the previous Lagrangian (8.5) and experiment with it in Betancourt et al. (2018); Jordan (2018), what is proposed in Wibisono et al. (2016) is the

C.M. Campos, A. Mahillo and D. Mart ın de Diego

3-phase scheme

xk+1 = p k + pzk + k k + pyk , (8.13a)

yk = argmin y

fp 1(y; xk) + N

php y xk p , (8.13b)

zk = zk 1 Dk

ptp 2 k h2 f(yk) , (8.13c)

with z0 = y0 = x0 and where p = n 1, fp 1(y; xk) is the (p 1)-th Taylor expansion of f about xk and N is a constant related to D and p that ensures convergence. Note that for n = 3 (p = 2), the optimization problem (8.13b) is explicit and reduces to

N h2 f(xk) , (8.14)

but it is implicit in general, which increases the cost of the method, aside of having to compute the Hessian and higher derivatives of f, either explicitly or by autodiﬀerentiation. In Section 8.4, we will refer to method (8.13) by WWJ, after the authors, and consider it as a (modiﬁed) NAG method.

8.3 Objective Functions

Several objective functions are considered: a highly dimensional quadratic function with tridiagonal matrix representation, a generalized Rosenbrock function, yet another test function for momentum-descent methods, and one that combines a generalized logistic function with a mean loss and that is often used in Neural Networks.

8.3.1 Highly Dimensional Quadratic Function

In Betancourt et al. (2018), they consider the quadratic map on Rn

2 x, Σ 1x , (8.15)

where Σ is the matrix whose elements are Σij = ρ|i j|, with ρ = 0.9 and n = 50, and whose inverse Σ 1 is the tridiagonal matrix

Σ 1 = 1 1 ρ2

1 ρ ρ 1 + ρ2 ρ ... ... ... ρ 1 + ρ2 ρ ρ 1

8.3.2 Generalized Rosenbrock Function

The Rosenbrock function (Rosenbrock, 1960/61), whose expression is

f(x, y) = (a x)2 + b(y x2)2 ,

Discrete Variational Calculus for Accelerated Optimization

with a, b > 0, represents a banana-shaped ﬂat-valley surrounded by steep walls with a unique critical point and global minimum at a, a2, whose search by numerical means is diﬃcult, hence its use to test and benchmark optimizers. We consider here its generalization to higher dimensions, n > 2, namely

(1 xi)2 + 100(xi+1 x2 i )2 . (8.16)

As in the two-dimensional case, the function has a global minimum at (1, 1, . . . , 1) but, unlike it, also has a local minimum close to ( 1, 1, . . . , 1) (the higher is the dimension, the closer it gets).

8.3.3 Yet Another Test Function (YATF)

Another example that might hinder the search of a minimum is the following

f(x, y) = sin(2x2 y2 + 3) cos(x + 1 exp(2y)) , (8.17)

which has a local minimum close to (0.32, 1.60).

8.3.4 Multinomial Logistic Regression

In artiﬁcial neural networks (ANN), the activation function of a node (or neuron) deﬁnes the output of that node given an input (or set of inputs). Supervised learning is a learning paradigm of the training process of the ANN. Diﬀerent choices are available for the activation function and the training process, which is a subject that might be in some cases controversial within the ANN community, but that is not the object of this work. We consider a shallow neural network with a single layer for classiﬁcation in n classes or targets given m features. Upon reaching the neurons, the inputs (features) are weighted and possibly biased, which in fact is the model to be determined through the learning process,

w Rm n, b Rn : x Rm 7 w x + b Rn .

The chosen activation function is a classiﬁer, a generalization of the logistic function, the multinomial logistic function (a.k.a. softmax),

σ(z) = ezj P k ezk

As loss function, we choose the cross-entropy or log-loss,

H(ˆy, y) = y, log ˆy ,

where ˆy Rn is the computed output, y {0, 1}n with P j yj = 1 is the expected output (class), and log is applied componentwise. We take as objective function the average loss over a training dataset D of length |D|, namely,

f(w, b) = 1

(x,y) D y, (log σ)(w x + b) . (8.18)

It is important to note here that f is the sum of convex functions and therefore itself convex, however f need not have a global minimum but an asymptotic inﬁmum: 0 as, for instance, exp(t) opposed to cosh(t). This depends on the data fed for the training process.

C.M. Campos, A. Mahillo and D. Mart ın de Diego

8.4 Experiments

Many experiments have been performed, we optimized each test function using the aforementioned methods (and others) set with diﬀerent parameters and initial guesses, from which we present here a small but suggestive sample. In summary, we minimize the test functions (8.16) and (8.17), the quadratic function (8.15), and the loss (8.18) with the PHB/CM and NAG methods (7.9) given by the constant, bounded and unbounded coeﬃcients (8.12), (8.10), and (8.11), respectively, and WWJ s method (8.13), the latter three for n = 3, 4, for a total of seven methods. Each objective is minimized using its own initial guess, time-step, and number of epochs, which are ﬁxed for the seven simulations. We set Rosenbrock s test function with 30 dimensions and seek for its global minimum at (1, . . . , 1) from (0, . . . , 0) at a pace of h = 0.01 for 20000 epochs. In the case of the YATF, there is a local minimum near (0.32, 1.60) which we seek from ( 0.25, 0.35) with time-step h = 0.01 for 3800 epochs. A random point on a sphere of radius 50 is the initial guess for the 50-dimensional quadratic function, whose global minimum, sought for 10000 epochs with h = 0.1, is clearly at the origin. For the convergence tests, the log-loss function is fed with 10 arbitrary samples, with 4 features and 3 targets each, which deﬁnes an optimization problem of dimension 10 with no global minimum, hence we will seek for the inﬁmum for 12000 epochs at a pace of h 0.945 from a random weight distribution with null biases. For an actual ANN test, the log-loss function is fed with the widely used Iris dataset (Dua and Graﬀ, 2017; see below for more details). The methods have been implemented in Julia v1.8.2 (Bezanson et al., 2017), using solely as nonnative libraries NLsolve.jl (Mogensen et al., 2020) to solve the side problem (8.13c), Plots.jl (Breloﬀ, 2021) and PGFPlots X.jl for plotting, and Pluto.jl for an interactive notebook. Methods, functions, and simulations are available online (Campos, 2022a,b). All plots except Figures 1 and 7 represent the objective function residual against the epoch in log-log scale. In Figure 1, we can see how the trajectories computed by PHB/CM and NAG for the YATF pass by its minimum about (0.32, 1.60). They start at the bottom left and go upward until they realize they have overreached the minimum and, about (0.7, 1.9), they back up. Although not shown in the ﬁgure, this motion is repeated successively, but each time they back up, they do so sooner and closer to the minimum, like a heavy ball in a bowl. The trajectories shown in the ﬁgure correspond to the iterations 765 to 1450 of the methods. Similarly we observe in Figure 2 that NAG oscillates less than PHB/CM. Each downward peak corresponds to the trajectory passing by the minimum of the YATF. In fact, the trajectories shown in Fig. 1 correspond to the ﬁrst peak in blue of Fig. 2. These oscillations where the trajectories pass by the minimum of the objective function back and forth, and the fact that NAG oscillates less than PHB/CM slightly outperforming it are common aspects of all the simulations performed, reason why in the remaining ﬁgures we focus solely on NAG methods, where several aspects are worth noting. Figures 3-6 compare the seven methods enumerated above, one ﬁgure for each objective function presented in Section 8.3, and following the same order. Each ﬁgure is made up of two plots: the top one is composed of methods with n = 3, the bottom one of methods with n = 4, whereas both include the NAG method with constant coeﬃcients for proper reference.

Discrete Variational Calculus for Accelerated Optimization

Yet Another Test Function

0.0 0.2 0.4 0.6 0.8

PHB/bounded (n = 3) NAG/bounded (n = 3)

Figure 1: Trajectory slices nearby the local minimum of the YATF using PHB/CM (pale) and NAG (strong) with the bounded coeﬃcients from the Lagrangian s polynomial dilation with n = 3. A nonlinear grayscale gradient indicates the minimum s location in black.

Yet Another Test Function

103.0 103.3

PHB/bounded (n = 3) NAG/bounded (n = 3) PHB/bounded (n = 4) NAG/bounded (n = 4) PHB/constant NAG/constant

Figure 2: YATF residual values along PHB/CM (pale) and NAG (strong) trajectories for coeﬃcients from the polynomially dilated Lagrangian with n = 3 (blue) and n = 4 (violet), and from the exponentially dilated Lagrangian (green).

C.M. Campos, A. Mahillo and D. Mart ın de Diego

Quadratic function (dimension 50)

NAG/constant NAG/bounded (n = 3) NAG/unbounded (n = 3) WWJ (n = 3)

101 102 103 104

NAG/constant NAG/bounded (n = 4) NAG/unbounded (n = 4) WWJ (n = 4)

Figure 3: Quadratic test function values along trajectories computed with NAG for constant (green), bounded (blue), and unbounded (violet) coeﬃcients, and WWJ (red); the latter three set with n = 3 (top) and n = 4 (bottom).

Discrete Variational Calculus for Accelerated Optimization

Rosenbrock s test function (dimension 30)

NAG/constant NAG/bounded (n = 3) NAG/unbounded (n = 3) WWJ (n = 3)

103.0 103.5 104.0

NAG/constant NAG/bounded (n = 4) NAG/unbounded (n = 4) WWJ (n = 4)

Figure 4: Rosenbrock s test function values along trajectories computed with NAG for constant (green), bounded (blue), and unbounded (violet) coeﬃcients, and WWJ (red); the latter three set with n = 3 (top) and n = 4 (bottom).

C.M. Campos, A. Mahillo and D. Mart ın de Diego

Yet Another Test Function (dimension 2)

NAG/constant NAG/bounded (n = 3) NAG/unbounded (n = 3) WWJ (n = 3)

103.0 103.3

NAG/constant NAG/bounded (n = 4) NAG/unbounded (n = 4) WWJ (n = 4)

Figure 5: YATF residual values along trajectories computed with NAG for constant (green), bounded (blue), and unbounded (violet) coeﬃcients, and WWJ (red); the latter three set with n = 3 (top) and n = 4 (bottom).

Discrete Variational Calculus for Accelerated Optimization

Multinomial logistic loss (dimension 10)

NAG/constant NAG/bounded (n = 3) NAG/unbounded (n = 3) WWJ (n = 3)

NAG/constant NAG/bounded (n = 4) NAG/unbounded (n = 4) WWJ (n = 4)

Figure 6: Logistic loss values along trajectories computed with NAG for constant (green), bounded (blue), and unbounded (violet) coeﬃcients, and WWJ (red); the latter three set with n = 3 (top) and n = 4 (bottom).

C.M. Campos, A. Mahillo and D. Mart ın de Diego

Observe ﬁrst that, the three methods with n = 3 behave similarly for the four objective functions. This is not surprising since the Lagrangians considered reduce to a potentially dilated mechanical Lagrangian in which the objective function, in the cases of unbounded NAG and WWJ, is coupled with a further constant coeﬃcient, D = 1/4 when n = 3, which slightly delays convergence when compared to bounded NAG. This coeﬃcient needs to be small in order to ensure convergence according to Wibisono et al. (2016). When n = 4, these three methods clearly show their diﬀerences. Most importantly, unbounded NAG (8.11) blows up in Figure 3 (highly dimensional quadratic function). This is due to the increasing and unbounded learning rate coeﬃcient η that is ignored when the gradient is almost null but brakes the method when it steps at a point where the gradient is not negligible. Therefore we infer that unbounded NAG is not suited for long runs when n 4 and should be restarted to avoid blow-up. Surprisingly WWJ is not aﬀected by this problem even though the same coeﬃcient appears in its deﬁnition, Eq. (8.13). The method is numerically stable thanks to the fast convergence of the trajectory towards the minimum. In fact, WWJ really shows up when n = 4, where it clearly outperforms its NAG counterpart, unbounded NAG, as well as bounded NAG, but may stall when the simulation has advanced, Fig. 3. Besides, there is a trick into its performance, when n 4 the method must solve a side optimization problem, (8.13c), which is solved here using the NLsolve.jl library, something that it somewhat redundant and suﬀers from the curse of dimensionality: it is around 10 times slower than the other methods for the low dimensional case (YATF) and more than 100 times slower in the high dimensional ones (Rosenbrock and quadratic function). Nonetheless, this disadvantage might decrease or even disappear when the Bregman divergence is not the usual Euclidean norm (something worth exploring), since the associated methods are unlikely to be explicit. With regards to constant NAG, the results are somehow paradoxical. Recall that the exponentially dilated Lagrangian is the only case in which a good convergence rate is not ensured, Eq. (8.9), however it is the method that in general performs the best, Figures 3-5. Furthermore, although within the Machine Learning community it is perhaps the most popular method among the analyzed, it is the one that has performed the worst in the purely ML scenario, Figure 6. Fig. 6 is where the theoretical convergence rates are better seen. In this ﬁgure, the values obtained by unbounded NAG and WWJ coalesce for both, n = 3 and n = 4. In addition to the previous optimization problems, we apply the analyzed methods to a simple classiﬁcation problem using Firsher s Iris dataset (Dua and Graﬀ, 2017). The dataset consists of 150 entries (or samples) with 7 ﬁelds: 4 features and 3 targets. Therefore we consider a shallow neural network with a single layer of 3 neurons with 4 inputs. The input data (the features) are weighted and biases upon entry into the network, the computed output is compared with the expected output (targets) using the multinomial logistic function. This process is summarized by the objective function (8.18). For simplicity, we only consider four methods: constant, bounded, and unbounded NAG, and WWJ, the latter three with n = 3. We train the network (or optimize the objective function) at an increasing number of epochs (from 25 up to 250) for 1000 runs. At each run and for each number of epochs, the samples are randomly split in two: 100 samples for training and 50 samples for testing. At the same time, the initial weights of the network are drawn from a normal distribution with null mean and standard deviation σ = 10. This

Discrete Variational Calculus for Accelerated Optimization

Multinomial logistic loss

25 50 75 100 150 250

accuracy (%)

NAG/constant NAG/bounded (n = 4) NAG/unbounded (n = 4) WWJ (n = 4)

Figure 7: Average accuracies for Iris dataset achieved by NAG with constant (green), bounded (blue), and unbounded (violet) coeﬃcients, and WWJ (red); the latter three set with n = 4.

initial guess is kept for the four methods. Then the network is trained and the accuracy of network with the computed weights is measured (percentage of correct matches for the testing data). Figure 7 shows the average accuracy along the 1000 runs for each method versus the number of epochs. This ﬁgure is clearly consistent with what is obtained in Fig. 6.

9. Conclusions and Future work

In this paper, we have studied the relation between accelerated optimization and discrete variational calculus, proving a symplecticity property for the continuous diﬀerential equation in Theorem 1 which is also preserved by the corresponding discrete variational methods. We have derived Classical Momentum (CM) or Polyak s Heavy Ball (PHB), and Nesterov s Accelerated Gradient methods (NAG) from the discrete Hamiltonian and Lagrange-d Alembert principles in Theorem 6 adding forces in the picture and proven a one-to-one correspondence. Several simulations were performed showing the applicability of our techniques to optimization. Among all the methods, NAG with the constant coeﬃcients from the exponentially dilated Lagrangian, aside from being the simplest choice, it also seems to be the best one for general purpose applications according to the simulations. In a future paper, we will study whether the proposed optimization algorithm generated by using Lagrange-d Alembert principle achieves the accelerated rate for minimizing both strongly convex functions and convex functions (Wibisono et al., 2016; Shi et al., 2019). The

C.M. Campos, A. Mahillo and D. Mart ın de Diego

main idea is to discretize, using discrete variational calculus, the continuous Euler-Lagrange equations (with or without forces) while maintaining their convergence rates (see Vaquero et al., 2021, for recent advances in this topic). Moreover, the extension to problems of accelerated optimization in manifolds will be given using discrete variational calculus and well-know optimization techniques with retraction maps (Absil et al., 2008; see also Lee et al., 2021).

Acknowledgments

D. Mart ın de Diego acknowledges ﬁnancial support from the Spanish Ministry of Science and Innovation, under grants PID2019-106715GB-C21, from the Spanish National Research Council (CSIC), through the Ayuda extraordinaria a Centros de Excelencia Severo Ochoa R&D (CEX2019-000904-S). A. Mahillo would like to thank CSIC for its ﬁnancial support through a JAE Intro scholarship.

Ralph Abraham and Jerrold E. Marsden. Foundations of Mechanics. AMS Chelsea Publishing, Redwood City, CA, 2 edition, 1978.

P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008. ISBN 978-0-691-13298-3. doi: 10.1515/ 9781400830244. URL https://doi.org/10.1515/9781400830244. With a foreword by Paul Van Dooren.

Hedy Attouch, Zaki Chbani, and Hassan Riahi. Fast convex optimization via time scaling of damped inertial gradient dynamics. Pure Appl. Funct. Anal., 6(6):1081 1117, 2021. ISSN 2189-3756.

Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On symplectic optimization. ar Xiv, 1802.03653, 2018.

JeﬀBezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to numerical computing. SIAM review, 59(1):65 98, 2017. URL https://doi.org/10. 1137/141000671.

Sergio Blanes and Fernando Casas. A concise introduction to geometric numerical integration. Monographs and Research Notes in Mathematics. CRC Press, Boca Raton, FL, 2016. ISBN 978-1-4822-6342-8.

L. M. Br egman. A relaxation method of ﬁnding a common point of convex sets and its application to the solution of problems in convex programming. ˇZ. Vyˇcisl. Mat i Mat. Fiz., 7:620 631, 1967. ISSN 0044-4669.

Tom Breloﬀ. Plots.jl, May 2021. URL https://doi.org/10.5281/zenodo.4776893.

C edric M. Campos. High order variational integrators: a polynomial approach. In Advances in diﬀerential equations and applications, volume 4 of SEMA SIMAI Springer Ser., pages

Discrete Variational Calculus for Accelerated Optimization

249 258. Springer, Cham, 2014. doi: 10.1007/978-3-319-06953-1\ 24. URL https:// doi.org/10.1007/978-3-319-06953-1_24.

C edric M. Campos and J. M. Sanz-Serna. Palindromic 3-stage splitting integrators, a roadmap. J. Comput. Phys., 346:340 355, 2017. ISSN 0021-9991. doi: 10.1016/j.jcp. 2017.06.006. URL https://doi.org/10.1016/j.jcp.2017.06.006.

C edric M. Campos. Personal website, 2022a. URL https://cmcampos.xyz.

C edric M. Campos. Research repository, 2022b. URL https://github.com/cmcampos-xyz.

Beniamino Cappelletti-Montano, Antonio De Nicola, and Ivan Yudin. A survey on cosymplectic geometry. Rev. Math. Phys., 25(10):1343002, 55, 2013. ISSN 0129-055X. doi: 10.1142/S0129055X13430022. URL https://doi.org/10.1142/S0129055X13430022.

Agustin-Louis Cauchy. M ethode g en erale pour la r esolution des syst emes d equations simultan ees. C. R. Acad. Sci., 25:536 -538, 1847. URL https://gallica.bnf.fr/ark: /12148/bpt6k2982c/f540.item.

Elena Celledoni, Matthias J. Ehrhardt, Etmann Christian, Robert I Mc Lachlan, Brynjulf Owren, Carola-Bibiane Sch onlieb, and Sherry Ferdia. Structure preserving deep learning. ar Xiv, 2006.03364, 2020. URL https://arxiv.org/abs/2006.03364.

Manuel de Le on and Paulo R. Rodrigues. Methods of Diﬀerential Geometry in Analytical Mechanics, volume 158. Elsevier, Amsterdam, 1987. ISBN 0-08-087269-7.

Dheeru Dua and Casey Graﬀ. UCI machine learning repository, 2017. URL http:// archive.ics.uci.edu/ml.

Valentin Duruisseaux, Jeremy Schmitt, and Melvin Leok. Adaptive Hamiltonian variational integrators and applications to symplectic accelerated optimization. SIAM J. Sci. Comput., 43(4):A2949 A2980, 2021. ISSN 1064-8275. doi: 10.1137/20M1383835. URL https://doi.org/10.1137/20M1383835.

E. Hairer, C. Lubich, and G. Wanner. Geometric numerical integration, volume 31 of Springer Series in Computational Mathematics. Springer, Heidelberg, 2010. ISBN 978-3642-05157-9. Structure-preserving algorithms for ordinary diﬀerential equations, Reprint of the second (2006) edition.

P. Hartman. Ordinary diﬀerential equations, volume 38 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2002. ISBN 0-89871-510-5. doi: 10.1137/1.9780898719222. URL http://dx.doi.org/10.1137/1. 9780898719222. Corrected reprint of the second (1982) edition [Birkh auser, Boston, MA; MR0658490 (83e:34002)], With a foreword by Peter Bates.

Michael I. Jordan. Dynamical symplectic and stochastic perspectives on gradient-based optimization. In Proceedings of the International Congress of Mathematicians Rio de Janeiro 2018. Vol. I. Plenary lectures, pages 523 549. World Sci. Publ., Hackensack, NJ, 2018.

C.M. Campos, A. Mahillo and D. Mart ın de Diego

Taeyoung Lee, Molei Tao, and Melvin Leok. Variational symplectic accelerated optimization on lie groups. ar Xiv, 2103.14166, 2021. URL https://arxiv.org/abs/2103.14166.

Paulette Libermann. Sur les automorphismes inﬁnit esimaux des structures symplectiques et des structures de contact. In Colloque G eom. Diﬀ. Globale (Bruxelles, 1958), pages 37 59. Centre Belge Rech. Math., Louvain, 1959.

Paulette Libermann and Charles-Michel Marle. Symplectic geometry and analytical mechanics, volume 35 of Mathematics and its Applications. D. Reidel Publishing Co., Dordrecht, 1987. ISBN 90-277-2438-5. doi: 10.1007/978-94-009-3807-6. URL https: //doi.org/10.1007/978-94-009-3807-6. Translated from the French by Bertram Eugene Schwarzbach.

J. C. Marrero, D. Mart ın de Diego, and E. Mart ınez. On the exact discrete lagrangian function for variational integrators: theory and applications, 2016. URL https://arxiv. org/abs/1608.01586.

J. E. Marsden and M. West. Discrete mechanics and variational integrators. Acta Numer., 10:357 514, 2001. ISSN 0962-4929. doi: 10.1017/S096249290100006X. URL http://dx. doi.org/10.1017/S096249290100006X.

H akon Marthinsen and Brynjulf Owren. Geometric integration of non-autonomous linear Hamiltonian problems. Adv. Comput. Math., 42(2):313 332, 2016. ISSN 1019-7168. doi: 10.1007/s10444-015-9425-0. URL https://doi.org/10.1007/s10444-015-9425-0.

Patrick Kofod Mogensen, Kristoﬀer Carlsson, S ebastien Villemot, Spencer Lyon, Matthieu Gomez, Christopher Rackauckas, Tim Holy, David Widmann, Tony Kelman, Daniel Karrasch, Antoine Levitt, Asbjørn Nilsen Riseth, Carlo Lucibello, Changhyun Kwon, David Barton, Julia Tag Bot, Mateusz Baran, Miles Lubin, Sarthak Choudhury, Simon Byrne, Simon Christ, Takafumi Arakaki, Troels Arnfred Bojesen, benneti, and Miguel Raz Guzm an Macedo. Julianlsolvers/nlsolve.jl: v4.5.1, December 2020. URL https://doi.org/10.5281/zenodo.4404703.

Yu. E. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2). Dokl. Akad. Nauk SSSR, 269(3):543 547, 1983. ISSN 0002-3264.

Yurii Nesterov. Lectures on convex optimization, volume 137 of Springer Optimization and Its Applications. Springer, Cham, 2018. ISBN 978-3-319-91577-7; 9783-319-91578-4. doi: 10.1007/978-3-319-91578-4. URL https://doi.org/10.1007/ 978-3-319-91578-4. Second edition of [ MR2142598].

G. W. Patrick and C. Cuell. Error analysis of variational integrators of unconstrained Lagrangian systems. Numer. Math., 113(2):243 264, 2009. ISSN 0029-599X. doi: 10. 1007/s00211-009-0245-3. URL http://dx.doi.org/10.1007/s00211-009-0245-3.

Elijah Polak. Optimization, volume 124 of Applied Mathematical Sciences. Springer-Verlag, New York, 1997. ISBN 0-387-94971-2. doi: 10.1007/978-1-4612-0663-7. URL https: //doi.org/10.1007/978-1-4612-0663-7. Algorithms and consistent approximations.

Discrete Variational Calculus for Accelerated Optimization

Boris T. Polyak. Some methods of speeding up the convergence of iterative methods. ˇZ. Vyˇcisl. Mat i Mat. Fiz., 4:791 803, 1964. ISSN 0044-4669.

Boris T. Polyak. Introduction to optimization. Translations Series in Mathematics and Engineering. Optimization Software, Inc., Publications Division, New York, 1987. ISBN 0-911575-14-6. Translated from the Russian, With a foreword by Dimitri P. Bertsekas.

H. H. Rosenbrock. An automatic method for ﬁnding the greatest or least value of a function. Comput. J., 3:175 184, 1960/61. ISSN 0010-4620. doi: 10.1093/comjnl/3.3.175. URL https://doi.org/10.1093/comjnl/3.3.175.

J. M. Sanz-Serna and M. P. Calvo. Numerical Hamiltonian problems, volume 7 of Applied Mathematics and Mathematical Computation. Chapman & Hall, London, 1994. ISBN 0-412-54290-0.

B Shi, Du S. S., Jordan M.I., and Su J.U. Acceleration via symplectic discretization of high-resolution diﬀerential equations. ar Xiv, 2019. URL https://arxiv.org/pdf/1902. 03694.pdf.

Weijie Su, Stephen Boyd, and Emmanuel J. Cand es. A diﬀerential equation for modeling nesterov s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1 43, 2016. URL http://jmlr.org/papers/v17/15-084.html.

Ilya Sutskever, James Martens, George Dahl, and Geoﬀrey Hinton. On the importance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David Mc Allester, editors, Advances in Neural Information Processing Systems, volume 28 of Proceedings of Machine Learning Research, pages 1139 1147, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/sutskever13.html.

M. Vaquero, Mestres P., and J: Cort es. Resource-aware discretization of accelerated optimization ﬂows. Preprint, 2021. URL http://carmenere.ucsd.edu/jorge/ publications/data/2020_Va Me Co-tac.pdf.

Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. USA, 113(47):E7351 E7358, 2016. ISSN 0027-8424. doi: 10.1073/pnas.1614734113. URL https://doi.org/10.1073/ pnas.1614734113.

Andre Yohannes Wibisono. Variational and Dynamical Perspectives On Learning and Optimization. Pro Quest LLC, Ann Arbor, MI, 2016. ISBN 978-1369-05764-5. Thesis (Ph.D.) University of California, Berkeley.