# losing_momentum_in_continuoustime_stochastic_optimisation__c22217fa.pdf

Journal of Machine Learning Research 26 (2025) 1-55 Submitted 10/23; Revised 11/24; Published 6/25

Losing Momentum in Continuous-time Stochastic Optimisation

Kexin Jin kexinj@math.princeton.edu Department of Mathematics Princeton University Princeton, NJ 08544-1000, USA

Jonas Latz jonas.latz@manchester.ac.uk Department of Mathematics The University of Manchester Manchester, M13 9PL, United Kingdom

Chenguang Liu C.Liu-13@tudelft.nl Delft Institute of Applied Mathematics Technische Universiteit Delft 2628 Delft, The Netherlands

Alessandro Scagliotti scag@ma.tum.de CIT School, Technische Universit at M unchen, and Munich Center for Machine Learning (MCML) 85748 Garching bei M unchen, Germany

Editor: Silvia Villa

The training of modern machine learning models often consists in solving high-dimensional non-convex optimisation problems that are subject to large-scale data. In this context, momentum-based stochastic optimisation algorithms have become particularly widespread. The stochasticity arises from data subsampling which reduces computational cost. Both, momentum and stochasticity help the algorithm to converge globally. In this work, we propose and analyse a continuous-time model for stochastic gradient descent with momentum. This model is a piecewise-deterministic Markov process that represents the optimiser by an underdamped dynamical system and the data subsampling through a stochastic switching. We investigate longtime limits, the subsampling-to-no-subsampling limit, and the momentum-to-no-momentum limit. We are particularly interested in the case of reducing the momentum over time. Under convexity assumptions, we show convergence of our dynamical system to the global minimiser when reducing momentum over time and letting the subsampling rate go to inﬁnity. We then propose a stable, symplectic discretisation scheme to construct an algorithm from our continuous-time dynamical system. In experiments, we study our scheme in convex and non-convex test problems. Additionally, we train a convolutional neural network in an image classiﬁcation problem. Our algorithm attains competitive results compared to stochastic gradient descent with momentum.

Keywords: stochastic optimisation, momentum-based optimisation, piecewise-deterministic Markov processes, stochastic stability, deep learning

c 2025 Kexin Jin, Jonas Latz, Chenguang Liu, Alessandro Scagliotti.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-1396.html.

Jin, Latz, Liu, Scagliotti

1. Introduction

Machine learning and artiﬁcial intelligence play a fundamental role in modern scientiﬁc research and modern life. In many instances, the underlying learning process consists of solving a high-dimensional non-convex optimisation problem with respect to large-scale data. These problems have been approached by applicants and researchers using a large range of diﬀerent algorithms. Methods are based on, e.g. stochastic approximation, statistical mechanics, and ideas from biological evolution. Many of these methods deviate from simply solving the optimisation problem, but are actually also used for regularisation or approximate uncertainty quantiﬁcation. Unfortunately, many successfully employed methods are theoretically hardly understood. In this article, we investigate momentum-based stochastic optimisation methods in a continuous-time framework. We now introduce the optimisation problem, motivate stochastic and momentum-based optimisation, review past work, and summarise our contributions and main results.

1.1 Optimisation and continuous dynamics

We study optimisation problems of the form

min θ X Φ(θ), (1)

on the space X := RK, where

We denote the number of terms by N N := {1, 2, . . .} and assume that the functions Φi, with i I := {1, . . . , N}, are continuously diﬀerentiable with Lipschitz gradients. We assume that (1) is well-deﬁned and denote θ : argmin Φ. In this work, we study continuous-time dynamical systems that can be used to optimise functions of type Φ and to represent the dynamics of optimisation algorithms. The most basic of these dynamical systems is the gradient ﬂow

dt = Φ(ζt), ζ0 X, (2)

a continuous-time version of the gradient descent method. The ODE solution (ζ(t))t 0 converges to a stationary point of Φ under, e.g. certain convexity conditions. In practice, especially in modern machine learning, we often encounter optimisation problems that are non-convex, where the solution (ζ(t))t 0 may converge to a saddle point or a local minimiser. When discretising (2), the iterates actually only converge to (local) minimisers, see, e.g. Lee et al. (2016); Panageas and Piliouras (2017). However, as observed in Du et al. (2017), the algorithm could take a lot of iterations to escape a saddle point. The stochastic gradient descent (SGD) method going back to Robbins and Monro (1951) has been a popular alternative to the gradient descent method. In SGD, we randomly select one of the Φi (i I) and optimise with respect to that function before switching to another Φj (j I\{i}). This process is repeated until a stationary state is reached. A

Losing Momentum in Stochastic Optimisation

continuous-time model for stochastic gradient descent has recently proposed by Latz (2021) and extended by Jin et al. (2023) and Latz (2022). There, stochastic gradient descent is represented through the dynamical system

dt = Φi(t)(θt), θ0 X, (3)

where the index process (i(t))t 0 is a suitable homogeneous continuous-time Markov process on I. The processes (i(t))t 0, (θt)t 0 and any other stochastic processes and random variables throughout this work are deﬁned on the underlying probability space (Ω, F, P). The process (θt)t 0 represents a randomised gradient ﬂow whose potential is randomly replaced by another one when (i(t))t 0 jumps from one state to another. Both (θt)t 0 and (i(t), θt)t 0 are called stochastic gradient process. Since Φ often represents the averaging over data subsets, we refer to the process of replacing Φ by a Φi as subsampling. SGD has two main advantages: Since we consider only one of the Φi at a time, we signiﬁcantly reduce the computational cost when discretising the dynamical system. Moreover, the perturbation through the randomised sampling in stochastic gradient descent can help to overcome saddle points, see Daneshmand et al. (2018); Jin et al. (2021). However, as the method is purely gradient-based, we would expect that the escaping saddle points is still slow. Saddle points and local minimisers can sometimes be escaped when using momentum-based optimisation methods. Here, we replace the gradient ﬂow (2), by the underdamped gradient ﬂow given through

dt + Φ(qt) = 0, dq0

dt X, q0 X. (4)

This dynamical system describes the motion of a ball with mass m > 0 constrained to slide on the graph Φ subject to a constant gravitational ﬁeld and viscosity friction α > 0, see Attouch et al. (2000) for details. Intuitively, momentum is used to vault the optimiser out of saddle points and local minimisers.

1.2 Momentum

We now give a motivating example where the relevance of momentum in non-convex optimisation can be observed and then summarise relevant literature on momentum-based optimisation. The example is based on a simple one-dimensional non-smooth non-convex optimisation problem inspired by the training of neural networks. As we shall see, the introduction of momentum can prevent the convergence of the gradient ﬂow to a stationary point that is not the global minimiser. We later work in a framework that does not actually cover this introductory example, we present it only to give an intuition for the methodology. Let us consider the function Φ : R R deﬁned as

Φ(x) := (Re LU(x) 1)2 + x2 =

( x2 + 1 if x 0 2x2 2x + 1 if x > 0 (x R) (5)

where Re LU( ) := max{0, }. As shown in Figure 1, the function Φ attains its global minimum at the point x = 1

2. However, x = 0 is the minimiser of Φ restricted to the non-positive half-line ( , 0]. Therefore, any solution of the gradient ﬂow equation (2)

Jin, Latz, Liu, Scagliotti

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x

Figure 1: Plot of the potential Φ.

with potential Φ := Φ and initial value ζ0 < 0 converges to the saddle point x. In order to avoid the convergence to x, we can replace the gradient ﬂow by the associated underdamped gradient ﬂow (4) with mass m > 0, viscosity friction α > 0, and potential Φ. If the initial value q0 is negative, then the equation of the motion reduces to

dt2 + 2qt = αdqt

whenever the dynamical system stays in the negative half-line. At this point the natural question is whether there exist combinations of m and α such that, assuming that q0 < 0, the particle is not conﬁned in the negative half-line for every time t > 0. Using elementary theory of linear second-order diﬀerential equations, it turns out that the particle manages to overcome the false minimiser x = 0 if the following relation is satisﬁed:

α2 8m < 0, (7)

i.e. the friction α is suﬃciently small or the mass m suﬃciently large. In summary, momentum can be used to overcome stationary points of the target function. The use of momentum in optimisation dates back to the 1960s, when Polyak formulated the heavy ball method in convex optimisation, see Polyak (1963, 1964). In this context, the aim was to accelerate the convergence of the gradient descent method. Later, Nesterov proposed a momentum-based class of discrete accelerated methods for convex problems in his seminal paper Nesterov (1983). Momentum-based optimisation methods have been investigated thoroughly from the perspective of continuous-time dynamical systems. Indepth studies about the mechanical systems underlying Polyak s method can be found in Attouch et al. (2000) a work which has sparked considerable interest in the community. We recall, for instance, Attouch and Alvarez (2000) which have studied constrained convex optimisation. In Su et al. (2014), the authors have formulated a dissipative second order ODE that provides a continuous-time counterpart to the Nesterov method. Here, the viscosity friction parameter α is time-dependent and vanishing as t . This aspect was later further investigated in Attouch et al. (2018); Shi et al. (2019, 2021); Scagliotti and Colli Franzone (2022). In Attouch et al. (2019a,b), the authors combine vanishing viscosity with time-scaling of trajectories, leading to a dynamics similar to (4), but with an extra time-dependent coeﬃcient β(t) appearing in front of the term Φ, i.e.

dt2 θ(t) + α(t) d

dtθ(t) + β(t) Φ(θ(t)) = 0. (8)

Losing Momentum in Stochastic Optimisation

The parameter β satisﬁes β(t) as t . Finally, we mention more recent advances on second-order dynamics with time scaling that include Hessian-driven damping Attouch et al. (2022) and averaging Attouch et al. (2024). Next, we see how momentum is used within stochastic optimisation practice.

1.3 Momentum-based stochastic optimisation and the Adam algorithm

The most basic momentum-based stochastic optimisation method combines a symplectic Euler discretisation of the underdamped gradient ﬂow (4) with subsampling. This stochastic gradient descent method with classical momentum is given by ( vn = αvn 1 η Φin(θn 1), θn = θn 1 + vn, (9)

where α, η > 0 are hyperparameters and i1, i2, . . . Unif(I) are independent and identically distributed (i.i.d.) random variables. We refer to η as learning rate and refer to (9) just as classical momentum. The connection to the underdamped gradient ﬂow is more obvious when considering the timestep size to be equal to 1, η to be a scaling factor of the potential, and α to be a parameter that incorporates the velocity at the previous timestep and the friction. Classical momentum has been studied by Rumelhart et al. (1986); Sutskever et al. (2013). In deep learning, more advanced momentum-based descent methods have been widely in use, including Adam (Kingma and Ba, 2014). Adam is one of the most commonly used optimisers when training neural networks. The updating rule is the following,

un = β1un 1 + (1 β1) Φin(θn 1), vn = β2vn 1 + (1 β2) Φin(θn 1) 2,

θn = θn 1 α un/(1 βn 1 )

vn/(1 βn 2 )+β3 , (10)

where β1, β2, and α are hyperparameters, β3 is a small constant, and i1, i2, . . . Unif(I) (i.i.d.). Comparing to the classical momentum method, Adam uses not only an estimate of the ﬁrst moment of the gradient but also an estimate of the second moment as the part of the adaptive learning rate. The variable un is the biased ﬁrst moment, similar to the one used in the classical momentum method. The variable vn is the biased second moment estimated using the squared gradient. When updating θn, Adam normalises un and vn so that they become unbiased estimators. Moreover, it was shown in D efossez et al. (2022) that Adam converges in the sense of Zinkevich (2003) with speed O(log(t)/

t; t ). Other adaptive learning rate methods are Adagrad (Duchi et al., 2011), Adadelta (Zeiler, 2012) and RMSprop (Hinton, 2012). While momentum-based stochastic optimisation methods are popular in machine learning practice, they are overall rather badly understood; see the discussion in Liu et al. (2020). In the present work, we analyse a continuous-time stochastic gradient process with momentum and propose an eﬃcient discretisation technique for this dynamical system. Thus, we improve the understanding of momentum-based stochastic optimisation in a theoretical framework and machine learning practice. Our analysis complements the Langevin-based analyses by Li et al. (2019); Shi (2024), where the authors represent SGD in continuous time as a diﬀusion process. Before we summarise our contributions more precisely, we introduce our continuous-time stochastic gradient-momentum process.

Jin, Latz, Liu, Scagliotti

1.4 The stochastic gradient-momentum process

Throughout this work, we discuss a continuous-time dynamical system that represents stochastic gradient descent with momentum the stochastic gradient-momentum process. Due to the continuous-time nature of the system, it can equally represent classical momentum (9) and Adam (10) (although not the adaptive nature of the latter). We obtain this system by combining the underdamped dynamical system (4) with the stochastic gradient process (3). We now introduce the index process, the process that controls the subsampling.

Deﬁnition 1 (Index process) Let (i(t))t 0 be a continuous-time Markov process on I with transition rate matrix AN = ΓN NγIN.

Here, γ > 0 denotes the jump rate, ΓN is an N N matrix whose entries are all equal to γ, and IN is the identity matrix. We assume that the initial distribution P(i(0) ) is the uniform distribution on I. Moreover, when given a value ν > 0 or an appropriate, strictly increasing function β : [0, ) [0, ), we deﬁne the rescaled index processes iν(t) := i(t/ν) and iβ(t) := i(β 1(t)), for t 0. We refer to each of (i(t))t 0, (iν(t))t 0, and (iβ(t))t 0 as index process.

The function β 1 and the scalar 1/ν play the role of a time-dependent and constant learning rate, respectively; see Jin et al. (2023). Thus, we sometimes refer to them as learning rate. Next, we make the following regularity assumptions.

Assumption 1 Let Φi C1(X, R), i.e. it is continuously diﬀerentiable, and xΦi be Lipschitz with Lipschitz constant L > 0, for i I.

Throughout this work, we study multiple versions of the stochastic gradient-momentum process and introduce them in Deﬁnitions 4, 8, and 17. We now introduce the prototype of the dynamical systems studied in this paper. Note that we often choose the more compact form dzt = f(zt)dt to represent dzt

dt = f(zt). The general stochastic gradient-momentum process (SGMP) is given by (p t, q t, iβ(t))t 0 satisfying

dq t = p tdt, m(t)dp t = Φiβ(t)(q t)dt αp tdt, p (t = 0) = p0, q (t = 0) = q0,

where (p t)t 0 describes the velocity of the particle, (q t)t 0 its position, (m(t))t 0 its mass (that may depend on time), and α > 0 the viscocity friction. The function β controls the switching of the index process. In the following, we explain the importance behind the functions m and β.

Remark 2 (Mass m) The introduction of momentum allows the particle to be vaulted out of stationary points. An eﬀect that is particularly pronounced if m is large or α is small. While this behaviour is convenient when escaping local minimisers and saddle points, it can also occur when approaching global minimisers. We give a very simple example in Figure 2, where we see that the method shows oscillatory behaviour and very slow convergence when

Losing Momentum in Stochastic Optimisation

the mass m is large and constant. As a tuning of the mass m is diﬃcult in practice, we suggest an alternative strategy: reducing the mass over time. Rationale: a large mass in the beginning leads to a fast escaping of local minimisers, while a small mass later leads to fast convergence once the global minimiser is reached. This intuition is conﬁrmed in Figure 2.

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Figure 2: We consider the minimisation of Φ(θ) := θ2/2. Let α = 1, m = 1, p0 = 1, q0 = 0. Then, qt = exp( t/2) sin(

3t/2) (t 0) (solid line) oscillates around the solution and converges ultimately to θ = 0. If we choose instead m(t) = (1 + t) 1 (t 0) (dashed lines), we have qt = exp( t2/2), which does not oscillate, but converges very quickly to θ .

In Attouch et al. (2000), the authors have considered the singular perturbation of (4), corresponding to letting the mass m 0, and they established convergence of these dynamics to the gradient ﬂow trajectories. In the present work, we also consider a vanishing mass and carefully bound the Euclidean distance between underdamped gradient ﬂow and gradient ﬂow. Moreover, we let m depend on the time, and investigate the case m(t) 0 as t . This setting is conceptually similar to the dynamics studied in Attouch et al. (2019a,b) that we give in (8). However, β(t) as t is not equivalent to the mass going to zero. We can rewrite the underdamped gradient ﬂow to

dt2 θ(t) + α m(t) d dtθ(t) + 1 m(t) | {z } =β(t)

Φ(θ(t)) = 0, (12)

where the factor multiplied by the velocity explodes as well, as m(t) 0. Thus, we cannot rely on the results obtained in Attouch et al. (2019a,b) for the dynamics (8). Moreover, we shall not actually focus our attention on the deterministic evolution, but we rather investigate a stochastic process related to the ODE (12). However, an interesting similarity between our approach and Attouch et al. (2019a,b) arises when discretising the respective dynamics. Indeed, the fact that β(t) in (8) and m(t) 0 in (12) make explicit schemes not suitable for deriving discrete approximations. In Attouch et al. (2019a,b) this

Jin, Latz, Liu, Scagliotti

issue is addressed by considering a proximal discretisation, while in Section 4 we opt for a symplectic (semi-implicit) scheme.

Remark 3 (Learning rate and β) The switching between the potentials leads to a similar eﬀect as the momentum. As the ﬂow approaches the minimisers of diﬀerent target functions in each step, we are unlikely to converge to a single point. While this helps to escape local minimisers, it prohibits the convergence to a global minimiser. In several stochastic optimisation methods, we need to reduce the learning rate or step-size throughout the algorithm to reduce the time in-between switches of data sets. We can attain this reduction of learning rate by rescaling the time in the index process (i(t))t 0 through an appropriate function β. The rescaling will lead to the waiting times between two switches to decrease over time.

We now list and explain more assumptions that are required throughout this work. Indeed, some results of this work require the (Φi)i I to be (µ-)strongly convex, i.e.

(x x ) ( Φi(x) Φi(x )) µ x x 2 (x, x X, i I),

for some strong convexity parameter µ > 0. Others do not rely on strong convexity, but still need some restriction on the potential function, as given in the following assumption.

Assumption 2 Let Φi C1(X, R) admit a global minimiser θi , for i I, and let there be constants λ (0, 1/4] and α > 0 such that

(x θi ) Φi(x)/2 λ(Φi(x) Φi(θi ) + α2 x θi 2 /4), (x X, i I). (13)

We use Aλ, α as a short hand to refer to (13) with parameters λ and α. We note that Assumption 2 is also known as the quasi-strong convexity, see, e.g. Aujol et al. (2022); Eberle et al. (2017); Necoara et al. (2019). Assumption 2 is implied by strong convexity. We prove this assertion in Lemma 28 in the appendix. In the appendix, we also give a counterexample showing that the two assumptions are not equivalent, see Example 2. In stochastic optimisation, it is often necessary to decrease the learning rate over time to obtain a method that converges to the minimiser. In practice, however, this is sometimes purposefully disregarded convergence to, e.g. a probability distribution is preferred. In stochastic optimisation this leads to an implicit regularisation of the optimisation problem. Particularly in machine learning, this implicit regularisation can be necessary to get good generalisation results, see, for example, Sekhari et al. (2021); Keskar et al. (2017); Neyshabur et al. (2015); Zhang et al. (2017); Zou et al. (2021). Thus, we also study this constant learning rate case in the present work. Similarly, whilst we have seen that a decreasing mass can be beneﬁcial, we also study the case of a constant mass. Indeed, when the mass m does not depend on t or the function β is linear, i.e. β(u) = νu, we refer to mass or learning rate as homogeneous, respectively. Since we are otherwise always interested in reducing them over time, we speak of decreasing mass or learning rate, respectively. When decreasing mass and learning rate, an additional assumption Assumption 4 is needed. Since this assumption is rather technical, we delay its statement and discuss it instead along with the associated convergence result in Subsection 3.2. In order to establish well-posedness of the system, the mass (m(t))t 0 must not decrease too quickly. Here, we have the following assumption.

Losing Momentum in Stochastic Optimisation

Assumption 3 There exists a constant λ (0, 1], such that |m (t)| λm(t), t > 0. Moreover, we assume that m(0) =: m0 1.

It is easy to verify that m(t) = m0 λ 1+t and m(t) = m0e λt satisfy Assumption 3. Setting m0 = λ = 1, we retrieve the example in Figure 2. More generally, Assumption 3 implies m(t) m0e λt. Hence, the mass decreases at most exponentially with a rate 1.

1.5 Contributions and main results

We have introduced the stochastic gradient-momentum process above. In this work, we analyse this dynamical system, discuss its discretisation, and employ it in numerical experiments. We are interested in both, convergence to global minimisers and implicit regularisation. Thus, we study three diﬀerent stochastic gradient-momentum processes: (1) homogeneous mass and homogeneous learning rate, (2) decreasing mass and homogeneous learning rate, and (3) decreasing mass and decreasing learning rate, respectively. We now summarise our contributions:

We study the connection between SGMP and the gradient ﬂow, the underdamped gradient ﬂow, as well as the stochastic gradient process. These connections allow us to investigate the longtime behaviour of the stochastic gradient-momentum process.

We obtain those connections by considering momentum-to-no-momentum asymptotics (letting a homogeneous mass m approach 0), and random-to-deterministic asymptotics (for a homogeneous β(u) = νu, letting ν approach 0). Understanding the longtime behaviour eventually allows us to study the convergence of the underlying algorithm. We summarise our three main results below and also give informal statements of the corresponding theorems the complete versions are given later in the text.

In the fully homogeneous case, we show that the stochastic gradient-momentum process is close to the stochastic gradient process when the mass is small. This can be understood as the stochastic version of the deterministic result established in Attouch et al. (2000).

Theorem 6 (informal) Let (qm t )t 0 be the stochastic gradient-momentum process (15) with constant mass m and (θm t )t 0 be the stochastic gradient process (16). Under Assumptions 1 and 2, we have

E h sup 0 t T qm t θm t i 0 (m 0).

If the mass decreases over time, the stochastic gradient-momentum process converges to the stochastic gradient process in the longtime limit.

Theorem 15 (informal) Let ( qt)t 0 be the decreasing-mass stochastic gradientmomentum process (19) with time-dependent momentum (m(t))t 0 and (θt)t 0 be the stochastic gradient process (3). Under strong convexity and Assumptions 1 and 3, we have, almost surely and in expectation,

qt θt 0 (t ).

Jin, Latz, Liu, Scagliotti

If mass and learning rate decrease over time, the stochastic gradient-momentum process converges to the minimiser of the full target function in the longtime limit.

Corollary 19 (informal) Let (ˆqt)t 0 be the decreasing-mass, decreasing-learningrate stochastic gradient-momentum process with time-dependent mass (m(t))t 0 and learning rate (β(t))t 0. Let θ be the minimiser of the full target function. Under strong convexity and Assumptions 1, 3, and 4, we have

ˆqt θ in probability 0 (t ).

The connection to the stochastic gradient process is especially interesting, as its longtime behaviour in convex settings is well-understood, see Jin et al. (2023); Latz (2021). Thus, we can argue that SGMP leads to a similar implicit regularisation when the mass is small or converging to zero. In a wider sense, the connections to the other methods allow us to interpret the stochastic gradient-momentum process as a method that can freely interpolate between gradient ﬂow, underdamped gradient ﬂow, and stochastic gradient process through adjusting learning rate ν 1 and mass m. We depict this relationship in Figure 3.

learning rate ν

Stochastic Gradient-

Momentum Process

Stochastic Gradient Process Gradient

Underdamped Gradient Flow

Figure 3: Schematic comparing gradient ﬂow, underdamped gradient ﬂow, stochastic gradient process, and stochastic gradient-momentum process in terms of the particle mass m and learning rate ν.

Surprisingly, the continuous relationship shown in Figure 3 is lost when, e.g. comparing the discrete-time algorithms SGD and classical momentum. This is mainly due to the symplectic Euler discretisation that is employed in classical momentum. Especially problematic is the instability of classical momentum when the mass is small. The step-size needs to be chosen as O(m; m 0). From this computational perspective, our contributions are the following:

We propose a discretisation strategy for the stochastic gradient-momentum process. The strategy is a semi-implicit method that is explicit in the position (qt)t 0 and

Losing Momentum in Stochastic Optimisation

implicit in the velocity (pt)t 0. This discretisation technique allows us to choose the step-size independently of the mass m. Thus, the method is stable for small masses as well as masses converging to zero. Similarly to the symplectic Euler method that is conventionally used in momentum-based optimisation, our scheme is symplectic. Thus, it respects (local) conservation laws in the underdamped dynamics.

We test our discretised algorithm in numerical experiments. We start with academic convex and non-convex optimisation problems in low and high dimensions. Then we study the training of a convolutional neural network (CNN), which we use to classify images from the CIFAR-10 dataset. The achieved train and test accuracy with our stable discretisation is comparable to stochastic gradient descent with classical momentum. Some of these experiments go beyond the convex setting that we study in our analysis.

1.6 Organisation of the work

This work is organised as follows. We introduce and investigate the homogeneous-in-time and heterogeneous-in-time stochastic gradient-momentum processes in Sections 2 and 3, respectively. In Section 4 we propose discretisation techniques which we then employ in the mentioned academic and deep learning problems in Section 5. A conclusion is given in Section 6. Throughout the main part of the work, we require certain results about underdamped gradient ﬂows, which we discuss in Appedices A and B. Other auxiliary results needed throughout this work, are presented in Appendix C. Appendix D recalls and summarises a result by Kushner (1990).

2. Homogeneous-in-time

In this section, we study the stochastic gradient-momentum process with homogeneous momentum and learning rate. In the ﬁrst part, Subsection 2.1, we investigate the interplay between the underdamped gradient ﬂow and the stochastic gradient-momentum process. In the second part, Subsection 2.2, we compare the stochastic gradient-momentum process and the stochastic gradient process. We ﬁrst introduce the dynamical system central to this section.

Deﬁnition 4 The homogeneous stochastic gradient-momentum process (h SGMP) is a solution of the following stochastic diﬀerential equation,

dqν,m t = pν,m t dt, mdpν,m t = Φiν(t)(qν,m t )dt αpν,m t dt, pν,m 0 = p0 X, qν,m 0 = q0 X,

where α, m, ν > 0 are the viscosity friction parameter, the mass, and the learning rate, respectively all kept constant. Finally, Φj satisﬁes Assumption 1, for every j I and the stochastic process iν(t) = i(t/ν) is given as in Deﬁnition 1. In the following, we denote

(i) (pν t , qν t )t 0 := (pν,1 t , qν,1 t )t 0, i.e. choosing (without loss of generality) a constant unit mass m := 1 and varying the learning rate ν.

Jin, Latz, Liu, Scagliotti

(ii) (pm t , qm t )t 0 := (pmδ,m t , qmδ,m t )t 0, for some δ [0, 1), i.e. varying the mass, and choosing a constant or a mass-depending learning rate ν := mδ; in this case, we additionally choose (without loss of generality) a unit viscosity friction α := 1.

The Lipschitz condition in Assumption 1 guarantees the well-posedness of this system, which can be shown similarly as in Jin et al. (2023); Latz (2021). In Subsections 2.1 and 2.2, we consider the h SGMPs (pν t , qν t )t 0 and (pm t , qm t )t 0, respectively.

2.1 Momentum with and without subsampling

First, we study the interplay between subsampling and not subsampling in the momentum dynamics. Indeed, we show that the stochastic gradient-momentum process can approximate the underdamped gradient ﬂow at any accuracy. More precisely, we are interested in the limiting behavior of h SGMP (pν t , qν t )t 0 as the learning rate approaches zero uniformly, i.e. we take ν 0 in (14). We prove that the h SGMP converges to the solution to the underdamped gradient ﬂow, which is given by

dqt = ptdt, dpt = Φ(qt)dt αptdt, p0, q0 X, (15)

where we implicitly set the mass m := 1. In this case, we show that h SGMP is a stochastic approximation to the underdamped gradient ﬂow. We now formulate the statement about the convergence of h SGMP to the underdamped gradient ﬂow more particularly. In principle, we show convergence of a random path to a deterministic path both are contained in the space of continuous functions from [0, ) to X2 which we denote by C([0, ), X2) and equip with the metric

ρ (ζt)t 0, (ξt)t 0 := Z

0 e t 1 sup 0 s t ζs ξs dt,

where as usual a b := min{a, b} for a, b R. Probabilistically, we show convergence in the weak sense. We state and prove the convergence result below.

Theorem 5 Let (qν t , pν t )t 0 and (qt, pt)t 0 solve (14) and (15). Then (qν t , pν t )t 0 converges weakly to (qt, pt)t 0 in C([0, ), X2) as ν 0, i.e. for any bounded continuous function F : C([0, ), X2) R,

E[F (pν t , qν t )t 0 ] E[F (pt, qt)t 0 ] = F (pt, qt)t 0 .

Proof We apply Theorem 4.3 in Kushner (1990, Chapter 7), which studies tightness and weak convergence properties of solutions of a certain type of stochastic diﬀerential equations. The form of the diﬀerential equation is given in (2.2) (Kushner, 1990, Chapter 7), we also recall this result in Appendix D. This corresponds to equation (15) with ξt = i(t), G(x, y) = (y, x Φ(x) αy), G(x, y, i) = (0, xΦi(x)+ x Φ(x)), and F = 0. In Kushner (1990, Chapter 7, Theorem 4.3), the diﬀerential operator A in our case is deﬁned in (14), i.e. Ah(x, y) = h(x, y) (y, x Φ(x) αy) for any h twice diﬀerentiable. If Assumptions

Losing Momentum in Stochastic Optimisation

(A4.2) to (A4.6) from Kushner (1990) were satisﬁed, applying Theorem 4.3 in (Kushner, 1990, Chapter 7) would imply that (qν t , pν t )t 0 is tight and the limit of any weakly convergent subsequence solves (14). Since qν 0 = q0 and pν 0 = p0, we have (qν t , pν t )t 0 converges weakly to (qt, pt)t 0. Therefore, we just need to verify Assumptions (A4.2) to (A4.6) from Kushner (1990). Assumption (A4.2) follows directly from Assumption 1. Assumption (A4.3) holds since (i(t))t 0 is c adl ag and bounded. Assumptions (A4.4) and (A4.6) trivially hold since F = 0. The only non-trivial part is to verify Assumption (A4.5), that is as t, τ ,

t E[ xΦi(s)(x) + x Φ(x)|(i(s ))s t]ds 0,

almost surely. By the Markov property of (i(t))t 0 and since i(0) is stationary,

t E[ xΦi(s)(x) + x Φ(x)|{i(s )}s t]ds

0 Ei(t)[ xΦi(s)(x) + x Φ(x)]ds

0 Ei(t)[ xΦi(s)(x) x Φ(x)]ds

0 Ek[ xΦi(s)(x) x Φ(x)]ds

N xΦj(x) + exp( Ns)Φk(x) x Φ(x)ds

0 exp( Ns) xΦk(x) x Φ(x) ds

xΦk(x) x Φ(x) 0.

Hence, h SGMP (qν t , pν t )t 0 is a stochastic approximation of the underdamped gradient ﬂow (qt, pt)t 0. Assumption 2 on the other hand implies that the position of the underdamped gradient ﬂow qt converges to the minimiser of Φ as t . This already gives us reason to believe that an SGMP is able to identify the minimiser of Φ. We discuss the longtime behaviour of the underdamped gradient ﬂow in detail in Appendix A.

2.2 Subsampling with and without momentum

In a similar way to the last section, we now study the relation between the homogeneous stochastic gradient-momentum process and the stochastic gradient process. Indeed, we let the momentum in h SGMP go to zero by letting the mass m go to zero. We show that in the limit, we converge to the stochastic gradient process. We do this while either keeping a constant learning rate or while allowing the learning rate to depend on the mass and thus go to zero as well. Speciﬁcally, we consider the process (pm t , qm t )t 0 introduced in

Jin, Latz, Liu, Scagliotti

Deﬁnition 4(ii). Namely, we set δ [0, 1), learning rate ν := mδ, and α := 1. We then aim to show that (qi,m t )t 0 in (15) converges uniformly to some stochastic gradient process as m 0. The associated stochastic gradient process is the following: dθm t = Φi(t/mδ)(θm t )dt, θm 0 = θ0. (16)

Throughout our discussion, we need several auxiliary results regarding momentum-tono-momentum limit in the deterministic case. We have collected these results, that may be of independent interest, in Appendix B. We establish the connection between (qm t )t 0 and (θm t )t 0 in the following theorem by iterating over jump times of the index process. In each of the iteration steps, we employ a result regarding the underdamped gradient ﬂow that we summarise and prove in Proposition 27 in Appendix B.

Theorem 6 Let Φi be convex and satisfy Assumptions 1 and 2 with Aλi,1 and critical point θi for i I. Let (pm t , qm t )t 0 and (θm t )t 0 solve (15) and (16), respectively. For 0 δ < 1 and T > 0, we have

E h sup 0 t T qm t θm t i 1 + C Lm(1 + T) em1 δT(1+T)γN q0 θ0 + m p0

+ em1 δT(1+T)γN 1 + C Lm(T + 1) KΦ,T,θ0,

where C L, KΦ,T,θ0 > 0 are constants.

Proof We now denote by (τn)n 0 and (τ m,δ n )n 0 the sequences of the jump times of processes (i(t))t 0 and (im,δ(t))t 0, respectively. We set τ0 = 0. From the deﬁnition, we know that the last jump time before t is τNt, where (Nt)t 0 is a Poisson process with rate γN. And since im,δ(t) = i(t/mδ), we have τ m,δ n = mδτn. By Proposition 27, for τ m,δ n < t τ m,δ n+1, we have

qm t θm t qm τ m,δ n θm τ m,δ n

+ C(0) L m(1 + t τ m,δ n ) qm τ m,δ n θm τ m,δ n

+ pm τ m,δ n

+ θm τ m,δ n θim,δ(τ m,δ n )

and from the ﬁrst inequality in Lemma 25, we obtain

pm t (1 + CLm) pm τ m,δ n

+ CL qm τ m,δ n θm τ m,δ n

+ θm τ m,δ n θim,δ(τ m,δ n )

From Lemma 29, we know θm τ m,δ n θim,δ(τ m,δ n )

+ θim,δ(τ m,δ n )

θ0 + CΦT + KΘ := KΦ,T,θ0.

0 t τ m,δ n+1 qm t θm t max n sup

0 t τ m,δ n qm t θm t , sup

τ m,δ n t τ m,δ n+1 qm t θm t o

0 t τ m,δ n qm t θm t + C(0) L m(1 + τ m,δ n+1 τ m,δ n ) qm τ m,δ n θm τ m,δ n

+ pm τ m,δ n

+ KΦ,T,θ0 .

Losing Momentum in Stochastic Optimisation

For T > 0, if we assume τ m,δ n+1 T, we have

0 t τ m,δ n+1 qm t θm t sup

0 t τ m,δ n qm t θm t

+ C(0) L m(1 + T) sup

0 t τ m,δ n qm t θm t + sup

0 t τ m,δ n pm t + KΦ,T,θ0

0 t τ m,δ n+1 pm t (1 + CLm) sup

0 t τ m,δ n pm t + CL sup

0 t τ m,δ n qm t θm t + KΦ,T,θ0 .

Am,δ n := sup

0 t τ m,δ n qm t θm t , Bm,δ n := sup

0 t τ m,δ n pm t , Dm,δ n = Am,δ n + m Bm,δ n .

and then rewrite the inequalities

Am,δ n+1 Am,δ n + m C(0) L (1 + T) Am,δ n + Bm,δ n + KΦ,T,θ0 ,

Bm,δ n+1 (1 + CLm)Bm,δ n + CL Am,δ n + KΦ,T,θ0 .

Since m < 1, there exists a constant C L such that

Dm,δ n+1 1 + C Lm(1 + T) Dm,δ n + C Lm(1 + T)KΦ,T,θ0. (17)

Hence, for n 0, we have

Dm,δ n 1 + C Lm(1 + T) n q0 θ0 + p0 + 1 + C Lm(1 + T) n 1 KΦ,T,θ0.

Let n = NT/mδ. From (18) and by using the moment-generating function of the Poisson distribution, we conclude:

E[Dm,δ NT/mδ ] q0 θ0 + m p0 E h 1 + C Lm(1 + T) NT/mδ i

+ KΦ,T,θ0E h 1 + C Lm(1 + T) NT/mδ 1 i

= em1 δT(1+T)γN q0 θ0 + CLm p0 + em1 δT(1+T)γN 1 KΦ,T,θ0.

For any T 0, there exists an integer n 0, such that τ m,δ n T τ m,δ n+1 (i.e. n = NT/mδ). Hence,

E h sup 0 t T qm t θm t i E h sup

0 t τ m,δ n+1 qm t θm t i E[Dm,δ NT/mδ +1]

(17) 1 + C Lm(1 + T) E[Dm,δ NT/mδ ] + C Lm(1 + T)KΦ,T,θ0

1 + C Lm(1 + T) em1 δT(1+T)γN q0 θ0 + m p0

+ em1 δT(1+T)γN 1 + C Lm(T + 1) KΦ,T,θ0.

Jin, Latz, Liu, Scagliotti

Thus, if we let the mass m go down to zero and the learning rate parameter ν is independent of m or ν goes to zero at a certain rate that is slower than m, the stochastic gradient-momentum process converges to the stochastic gradient process. The speed of convergence is linear in m, which is also the case in the deterministic setting. Note that we show convergence in a possibly weaker sense as compared with the learning rate result in Theorem 5. We ﬁnish our discussion of the homogeneous-in-time setting with a ﬁnal remark regarding reducing mass and learning rate at the same time.

Remark 7 From Latz (2021, Theorem 1) and Latz (2024), we already know that

(θm t )t 0 (ζt)t 0

if m 0, i.e. the stochastic gradient process converges to the gradient ﬂow as the learning rate approaches zero. Convergence is in the weak sense probabilistically and a weighted -norm

ρ (ζt)t 0, (ξt)t 0 :=

n=1 2 n 1 sup 0 s n ζs ξs ,

in space. Combined with Theorem 6, we can easily see that also (qm t )t 0 (ζt)t 0 in the case δ (0, 1), where the convergence is in the sense of Theorem 6.

3. Heterogeneous-in-time

In the previous section, we have studied the eﬀect of losing momentum and reducing the learning rate in a uniform fashion. Having a homogeneous mass and learning rate is unquestionably very popular in practice. However, while we could show convergence to the minimiser of the deterministic dynamics, this is not clear for the stochastic dynamics. To obtain convergence in SGMP, we need to reduce both, learning rate and momentum, over time. Hence, we need to discuss a heterogeneous version of the SGMP. Indeed, we now allow the mass m to be a non-constant function of time, especially to be reduced to zero over time, and the index process (iβ(t))t 0 to have waiting times that get smaller as time progresses. Thus we study the case where the stochastic gradient-momentum process converges to the stochastic gradient process and the gradient ﬂow, respectively, at runtime. We have already seen the advantage of reducing mass over time in the deterministic setting in Figure 2. We split this section into two parts. In Subsection 3.1, we let only the momentum depend on and decrease over time. We introduce the dynamical system, show well-posedness, and discuss the longtime behaviour of the new dynamical system. We see that asymptotically, the dynamical system behaves like the stochastic gradient process. Then, in Subsection 3.2, we decrease mass and learning rate over time. In this case, we discuss a setting, in which we see convergence of the stochastic dynamical system to the minimiser θ of the target function Φ.

3.1 Losing momentum over time

We consider the stochastic gradient-momentum process with a time-dependent, decreasing mass.

Losing Momentum in Stochastic Optimisation

Deﬁnition 8 The decreasing-mass stochastic gradient-momentum process (dm SGMP) is a solution of the following stochastic diﬀerential equation,

d qt = ptdt, m(t)d pt = Φi(t)( qt)dt ptdt, p0, q0 X, (19)

where Φj satisﬁes Assumption 1, j = 1, , N. The stochastic process (i(t))t 0 is deﬁned in Deﬁnition 1. The mass m(t) > 0 is strictly decreasing and diﬀerentiable with limt m(t) = 0.

The formal limit of (19) is the stochastic gradient process without momentum, as given in (3). As in the previous section, we start our discussion considering the decreasing mass case with a ﬁxed index, i.e. the following dynamical system:

dqi t = pi tdt, m(t)dpi t = Φi(qi t)dt pi tdt, pi 0, qi 0 X, (20)

where the mass (m(t))t 0 is chosen as before. In addition, we denote E(t) := R t 0 1/m(s)ds. The associated limiting equation is given by (43). Next, we discuss the well-posedness of the system (20).

Well-posedness

We study the well-posedness of the deterministic dynamical system (20). Well-posedness of the stochastic dynamical system (19) then follows from, e.g. Proposition 1 in Latz (2021).

Proposition 9 For any ﬁxed i I, let Φi satisfy Assumption 1 with constant L and m(t) satisfy Assumption 3. Then (20) admits a unique solution.

Proof We ﬁrst rewrite equation (20) as

dqi t = pi tdt, dpi t = Φi(qi t)/m(t)dt pi t/m(t)dt, pi 0, qi 0 X.

We deﬁne a map T0 : C([0, t1], R) C([0, t1], R) C([0, t1], R) C([0, t1], R), as

T0(x, y) = q0 + Z t

0 ysds, p0 Z t

0 Φi(xs)/m(s)ds Z t

0 ys/m(s)ds

where 0 t t1. Next, we show that T0 is a contraction for t1 small enough. Since m(t) is strictly decreasing and positive and using Assumption 3, we can conclude that

(eλtm(t)) 0 and m(t) m0e λt.

We then conclude

T0(q, p) T0(ˆq, ˆp) t1 p ˆp + L q ˆq

1 m(s)ds + p ˆp

( q ˆq + p ˆp )L + 1

λm0 (eλt1 1 + t1).

Jin, Latz, Liu, Scagliotti

Set t1 = λm0 2(L+1)(2λ+1) =: C(λ, m0, L), and since eλx 1 2λx when λx 1, we have

λm0 (eλt1 1 + t1) L + 1

λm0 (2λ + 1)t1 1

Then T0 is contracting with constant smaller than 1/2. By the Banach Fixed Point Theorem, we conclude that (20) admits a unique solution on the interval [0, t1] with initial value (q0, p0). Let Tn : C([tn, tn+1], R) C([tn, tn+1], R) C([tn, tn+1], R) C([tn, tn+1], R), with

Tn(x, y) = qtn + Z t

tn ys/m(s)ds, ptn Z t

tn Φi(xs)/m(s)ds Z t

tn ys/m(s)ds .

Then we have

Tn(q, p) Tn(ˆq, ˆp) (tn+1 tn) p ˆp + L q ˆq

( q ˆq + p ˆp )L + 1

λm0 eλtn(eλhn 1 + hn),

where hn = tn+1 tn. Set hn = C(λ, m0, L)e λtn = λm0e λtn 2(L+1)(2λ+1), then

λm0 eλtn(eλhn 1 + hn) L + 1

λm0 eλtn(2λ + 1)hn 1

This implies Tn is a contraction, and by the Banach Fixed Point Theorem, we conclude that (20) admits a unique solution on the interval [tn, tn+1] with initial value ( qtn, ptn). Therefore, by Lemma 31, we have tn , and, thus, (20) admits a unique solution globally.

As the unique solution to the dynamical system (20) is a strong solution, the solution is continuously diﬀerentiable. Thus, the map (pi 0, pi 0) 7 (pi t, pi t) is continuous for any t 0. Hence, the diﬀerential equations (19) and (20) are well-posed.

Longtime behavior

Having understood the well-posedness of dm SGMP, we are now interested in its longtime behaviour. By reducing the mass over time, we aim to reduce the momentum and prove convergence to the stochastic gradient process (3). To show this convergence result, we now collect a number of auxiliary results that mainly concern the deterministic system (20) that considers a single, ﬁxed sample. Some of these results may remind the reader of similar results shown in Appendix B regarding the ﬁxed sample, homogeneous momentum dynamics; we make these connections clear in the titles of the following results. The intermediate goal is to show a bound between the system (20) and its limiting equation (43). From there we then go back to the randomised setting and discuss the longtime behaviour and convergence of dm SGMP. We ﬁrst show that the deterministic system converges at exponential speed to its unique stationary point.

Losing Momentum in Stochastic Optimisation

Lemma 10 (cf. Lemma 24) Let (pi t, qi t)t 0 solve (20). Let θi be the critical point of Φi for i I. Under the Assumptions 1, 2 with Aλi,2, and 3 with λ = mini I λi, we have qi t θi 2 8e λt V i(0, qi 0, pi 0) (t 0), (21)

where V i(t, x, y) = m(t)(Φi(x) Φi(θi )) + 1

4 x θi + m(t)y 2 + m(t)y 2 .

Proof By Lemma 32, we have qi t θi 2 8V i(t, qi t, pi t) 8e λt V i(0, qi 0, pi 0).

In the next auxiliary result, we show Lipschitz continuity of the position (qi t)t 0 and boundedness of the velocity (pi t)t 0.

Lemma 11 (cf. Lemma 25) Let (pi t, qi t)t 0 solve (20). Let θi be the critical point of Φi for i I. Under the Assumptions 1, 2 with Aλi,2, and 3 with λ = mini I λi, for any 0 s t, we have qi t qi s C(1) L (t s) qi 0 θi + (m0 + e E(s)) pi 0 , pi t e E(t) pi 0 + C(1) L ( qi 0 θi + m0 pi 0 ) (t 0),

where C(1) L = 1 + 8L2 + 8L.

Proof From (20), we have

d(e E(t)pi t) dt = e E(t) Φi(qi t)/m(t),

which implies

pi t = e E(t)pi 0 e E(t) Z t

0 e E(s) Φi(qi s)/m(s)ds. (22)

pi t e E(t) pi 0 + e E(t) Z t

Φi(qi s) ds

=e E(t) pi 0 + e E(t) Z t

Φi(qi s) Φi(θi ) ds

e E(t) pi 0 + Le E(t) Z t

(21) e E(t) pi 0 + 8Le E(t)(V i(0, qi 0, pi 0)) 1 2 Z t

e E(t) pi 0 + 8L(V i(0, qi 0, pi 0)) 1 2 .

The last inequality holds since R t 0 e E(s) m(s)e λs

2 ds e E(t), which can be seen by taking the derivative on both sides and comparing the initial values. And since V i(0, qi 0, pi 0) (1 + L)( qi 0 θi 2 + m2 0 pi 0 2), we have pi t e E(t) pi 0 + (1 + 8L2 + 8L)( qi 0 θi + m0 pi 0 ).

Jin, Latz, Liu, Scagliotti

We set C(1) L = (1 + 8L2 + 8L) and have

qi t qi s Z t

pi m dm C(1) L (t s)( qi 0 θi + m0 pi 0 ) + pi 0 Z t

C(1) L (t s) qi 0 θi + (m0 + e E(s)) pi 0 .

In the third lemma, we give a bound on the system s acceleration (dpi t/dt)t 0.

Lemma 12 (cf. Lemma 26) Let (pi t, qi t)t 0 solve (20). Let θi be the critical point of Φi for i I. Under the Assumptions 1, 2 with Aλi,2, and 3 with λ = mini I λi, we have

( pi 0 + qi 0 θi ) (t 0), (23)

where C(2) L = 8(C(1) L )2 = 8(1 + 8L2 + 8L)2.

Proof From (22), we have

dpi t dt = e E(t)pi 0 m(t) Φi(qi t) m(t) + e E(t)

0 e E(s) Φi(qi s)/m(s)ds.

Notice that

Φi(qi t) m(t) = e E(t)

0 e E(s) Φi(qi t)/m(s)ds + e E(t) Φi(qi t) m(t) .

Then, we can rewrite dpi t dt as

dpi t dt = e E(t)(pi 0 + Φi(qi t)) m(t) + e E(t)

0 e E(s)( Φi(qi s) Φi(qi t))/m(s)ds.

Losing Momentum in Stochastic Optimisation

Hence, dpi t dt

pi 0 + Φi(qi t) Φi(θi ) + Z t

0 e E(s) Φi(qi s) Φi(qi t) /m(s)ds

(1 + L)e E(t)

pi 0 + qi t θi + Z t

0 e E(s) qi s qi t /m(s)ds

4C(1) L e E(t)

qi 0 θi + pi 0

0 (t s)e E(s)/m(s)ds qi 0 θi + (m0 + e E(s)) pi 0

= 4C(1) L e E(t)

qi 0 θi 1 + C(1) L

0 (t s)e E(s)/m(s)ds

+ 4C(1) L e E(t)

pi 0 1 + C(1) L

0 (t s)e E(s)(m0 + e E(s))/m(s)ds

4C(1) L e E(t)

m(t) + 2C(1) L qi 0 θi + 4C(1) L e E(t)

m(t) + 4C(1) L pi 0 ,

where the last step follows since Z t

0 (t s)e E(s)/m(s)ds 2m(t)e E(t)

and the inequality (a2) follows from Lemmas 10 and 11 and the fact that V i(0, qi 0, pi 0) (1 + L)( qi 0 θi 2 + m2 0 pi 0 2). Hence, dpi t dt

4C(1) L e E(t)

m(t) + 4C(1) L ( pi 0 + qi 0 θi )

8(C(1) L )2 e E(t)

m(t) + 2 ( pi 0 + qi 0 θi ).

We set C(2) L = 8(C(1) L )2 = 8(1 + 8L2 + 8L)2, which completes the proof.

We have now attained the aforementioned intermediate goal where we show a bound between the system (20) and its limiting equation (43). We need to additionally ask for some underlying convexity; strong convexity in this case: Indeed, we aim to show that qτn θτn is contractive. Assumption 2 would only lead to contractivity of qτn θ .

Proposition 13 (cf. Proposition 27) Let (pi t, qi t)t 0 solve (20). Let θi be the critical point of Φi for i I. We assume Φi is strongly convex with constant µ and let Φi satisfy Assumption 1 and (m(t))t 0 satisfy Assumption 3 with λ = (µ/4) (1/4). Then we have

qi t θi t e µt/2 qi 0 θi 0 + C(1) L,µm

1 2 0 pi 0 + qi 0 θi (t 0),

where C(1) L,µ = 2C(2) L µ 1/2(1 + µ 1/2 + µ1/2) + µ1/2, which is larger than µ1/2.

Jin, Latz, Liu, Scagliotti

Proof By Lemma 12 and Lemma 28, we have

1 2 d qi t θi t 2

dt = qi t θi t dqi t dt dθi t dt

= qi t θi t Φi(qi t) Φi(θi t) m(t) qi t θi t dpi t dt

µ qi t θi t 2 + µ

qi t θi t 2 + m2(t)

qi t θi t 2 + (C(2) L )2(µ) 1m2(t) e 2E(t)

m2(t) + 4 pi 0 2 + qi 0 θi 2

qi t θi t 2 + (2C(2) L )2(µ) 1 e 2E(t) + m2(t) pi 0 2 + qi 0 θi 2 ,

which implies

qi t θi t 2 e µt qi 0 θi 0 2

+ (C(2) L )2(µ) 1e µt Z t

e 2E(s) + m2(s) eµsds pi 0 2 + qi 0 θi 2 .

From Lemma 33, we know that e µt R t 0 m2(s)eµsds 2µ 1m2(t) 2µ 1m2 0 2µ 1m0, and e µt R t 0 e 2E(s)eµsds (1 + µ)m0, which implies

qi t θi t 2 e µt qi 0 θi 0 2 + (C(1) L,µ)2m0 pi 0 2 + qi 0 θi 2 .

From Lemma 11 and Proposition 13, we immediately have the following corollary.

Corollary 14 Under the same condition as Proposition 13. For 0 < m0 1 and any ﬁxed i I, we have

qi t θi t e µt/2 qi 0 θi 0 + C(1) L,µm0 pi 0 + qi 0 θi 0 + θi 0 θi , pi t (e E(t) + C(1) L m0) pi 0 + C(1) L qi 0 θi 0 + θi 0 θi (t 0).

We can now show the main statement of this section. The convergence of dm SGMP to the stochastic gradient process in the longtime limit. Importantly, we assume a coupling between the processes through the index process (i(t))t 0 that is identical in both dynamical systems. Throughout this section, we collected evidence for a contractive behaviour in between the deterministic processes (20) and (43). We now use these results to study the randomised version piece-by-piece. This, of course, is a very similar strategy to that used in Theorem 6.

Theorem 15 Let ( pt, qt)t 0 and (θt)t 0 solve (19) and (3), respectively. Let θi be the critical point of Φi for i I. We assume that Φi is strongly convex with constant µ and let Φi satisfy Assumption 1 and m(t) satisfy Assumption 3 with λ = (µ/4) (1/4) for i I.

Losing Momentum in Stochastic Optimisation

Let {τn}n 1 be the sequences of the jump times of the process (i(t))t 0. If we assume that

m0 (C(2) L,µ) 2µ2

4(2γN+µ)2 , where C(2) L,µ = C(1) L,µ + C(1) L , then

E h qτn θτn i 0 (n ).

Furthermore, if we additionally assume that m decays at least exponentially, i.e. there exist C, c > 0 such that m(t) Ce ct, we have

qt θt 0 (t )

almost surely and in expectation.

Remark 16 After the statement of Assumption 3, we have concluded that m(t) m0e λt. Thus, the constant c mentioned above needs to be smaller than λ.

Proof For any n 0, by Corollary 14, and since the initial value of m(t) in the interval [τn, τn+1) is exactly m(τn), we have qτn+1 θτn+1 e µ(τn+1 τn)/2 qτn θτn

+ C(1) L,µm(τn) qτn θτn + pτn + θτn θi(τn)

and pτn+1 e E(τn+1 τn) + C(1) L m(τn) pm τn + C(1) L qτn θτn + θτn θi(τn) .

Lemma 30 implies that θτn θi(τn) θ0 + KΦ = KΦ,θ0.

We denote an := e µ(τn+1 τn)/2 and obtain qτn+1 θτn+1 an qτn θτn + C(1) L,µm(τn) qτn θτn + pτn + KΦ,θ0

e E(τn+1 τn) e (τn+1 τn)/m0 e (τn+1 τn)(C(2) L,µ)2 e µ(τn+1 τn)/2 := an.

Hence, we also have pτn+1 (an + C(1) L m(τn)) pτn + C(1) L qτn θτn + KΦ,θ0 .

We denote An := qτn θτn , Bn := pτn and then have the following iteration inequalities:

An+1 an An + C(1) L,µm(τn) An + Bn + KΦ,θ0 ,

Bn+1 (an + C(1) L m(τn))Bn + C(1) L An + KΦ,θ0 .

Jin, Latz, Liu, Scagliotti

Moreover, denoting Dn = An + m 1 2 (τn)Bn, C(2) L,µ = C(1) L,µ + C(1) L , we have

Dn+1 (an + C(1) L,µm(τn))An + m 1 2 (τn+1)(an + C(1) L m(τn))Bn + C(1) L,µm(τn)Bn (24)

+ m 1 2 (τn+1)C(1) L An + C(1) L,µm(τn)KΦ,θ0 + C(1) L m 1 2 (τn+1)KΦ,θ0

an + C(2) L,µm 1 2 (τn) Dn + C(2) L,µm 1 2 (τn)KΦ,θ0

an + C(2) L,µm

1 2 0 Dn + C(2) L,µm 1 2 (τn)KΦ,θ0 (25)

and since τn+1 τn are exponentially distributed with parameter γN is independent of Fτn, we have

E[Dn+1] E[e µ(τn+1 τn)/2 + C(2) L,µ m0]E[Dn] + C(2) L,µKΦ,θ0E[m 1 2 (τn)], (26)

E[D2 n+1] E[(e µ(τn+1 τn)/2 + C(2) L,µ m0)2]E[D2 n] + (C(2) L,µKΦ,θ0)2E[m(τn)] (27)

+ C(3) L,µ m0KΦ,θ0E[Dn],

where C(3) L,µ = 2C(2) L,µ(1 + C(2) L,µ). Since m0 (C(2) L,µ) 2µ2

4(2γN+µ)2 , we have

E[e µ(τn+1 τn)/2 + C(2) L,µ m0] c1,

where c1 = µ+4γN 2(2γN+µ) < 1. From the iteration inequality and Lemma 34, we know that limn E[Dn] = 0, which ﬁnish the proof of the ﬁrst part. If we assume m(t) Ce ct, we then have

E[m 1 2 (τn)] CE[e cτn

2 ] = CE h n Y

i=1 e c(τi τi 1)

i=1 E h e c(τi τi 1)

2 i = C(c2)n

where c2 := E h e c(τi τi 1)

2 i = 2γN c+2γN which is a constant does not depend on i and that is smaller than 1. We then rewrite (26) as

E[Dn+1] c1E[Dn] + CC(3) L,µ m0KΦ,θ0(c2)n,

which follows from the second part of Lemma 34. We then have E[Dn] Be bn. This also implies

E[D2 n+1] c3E[D2 n] + CC(3) L,µ m0KΦ,θ0(c4)n.

Furthermore, by using Lemma 34, we have E[D2 n] Be bn for some B, b > 0. The Markov inequality implies that we have

n=1 P(|Dn| δ) δ 1 + X

n=1 E[Dn] δ 1 + X

n=1 E[Dn] δ 1 B

n=1 e bn < + ,

Losing Momentum in Stochastic Optimisation

for any δ > 0. This implies that Dn 0 almost surely as n , since Nt almost surely as t . We then have DNt 0 almost surely as t . By Corollary 14, for any t 0, we have

qt θt e µ(t τNt)/2 qτNt θτNt

+ C(1) L,µm(τNt) qτNt θτNt

+ θτNt θ i(τNt)

(1 + C(3) L,µ)DNt + C(1) L,µm(τNt)KΦ,θ0.

This implies that qt θt 0 almost surely as t since DNt 0 and m(τNt) 0 almost surely. Also, in order to prove E[ qt θt ] 0, it is suﬃcient to show E[DNt] + E[m(τNt)] 0 as t + . Since

E[D2 Nt] = X

k=0 E[D2 k11Nt=k]2 X

k=0 E[D2 k] < + ,

which implies that {DNt}t 0 is uniformly integrable. Then, E[DNt] 0 since DNt 0 almost surely. Note that limt + m(τNt) = 0. By the Dominated Convergence Theorem, we get

lim t + E[m(τNt)] = 0.

This completes the proof.

3.2 Losing, both, momentum and randomness over time

In the previous subsection, we have seen that the stochastic gradient process with decreasing momentum approaches the stochastic gradient process. Asymptotically, the stochastic gradient process converges to a stationary distribution, not necessarily to a single point. As discussed before, to attain convergence to a single point, we usually need to decrease the learning rate over time, see Latz (2021). We now study the following model.

Deﬁnition 17 The decreasing-mass, decreasing-learning-rate stochastic gradient-momentum process (dd SGMP) is a solution of the following stochastic diﬀerential equation,

dˆqt = ˆptdt, m(t)dˆpt = Φiβ(t)(ˆqt)dt ˆptdt, ˆp0, ˆq0 X, (28)

where Φj satisﬁes Assumption 1, j = 1, . . . , N. The mass m(t) > 0 is strictly decreasing and diﬀerentiable with limt m(t) = 0. The stochastic process (i(t))t 0 is deﬁned in Deﬁnition 1. For the re-scaled index process (iβ(t))t 0 = (i(β(t)))t 0, we assume that β(t) = R t 0 η(s)ds, t 0 with η : [0, ) (0, ) being a non-decreasing continuously diﬀerentiable function.

Jin, Latz, Liu, Scagliotti

As opposed to the previous section, we do not consider the stochastic gradient process with constant learning rate as the limiting system, but actually the associated stochastic gradient process with decreasing learning rate, denoted by dξt = Φiβ(t)(ξt)dt, ξ0 X. (29)

Next, we deﬁne the sequence {τn}n 1 to be the jump times of process (i(t))t 0 and τ β n = β 1(τn), to be the jump times of (iβ(t))t 0. We denote by

2η(τ β n+1) α1 n and m(τ β n ) α2e α3 n )

for n N an event that is used to impose a growth condition on β and m. Regarding this event, we now denote the following assumption.

Assumption 4 For n k, let W α,n k = Tn i=k Ωα i . There exist α1, α2, α3 > 0 such that limk + P(W α, k ) = 1.

The event W α,n k is increasing in k and decreasing in n. Assumption 4 implies that the complement of Ωα n is eventually small. We describe a setting for which this assumption holds in Example 1 below, after stating and proving the main results of this section: the convergence of dd SGMP to the stochastic gradient process with decreasing learning rate and the convergence of dd SGMP to the minimiser of the target function. In neither of these cases, we obtain a convergence rate. We later study the speed of convergence through numerical experiments in Section 5. But now we start with the ﬁrst of the two aforementioned results.

Theorem 18 Let (ˆpt, ˆqt)t 0 and (ξt)t 0 solve (28) and (29), respectively. For i I, we assume that Φi is strongly convex with constant µ and let Φi satisfy Assumption 1. Let (m(t))t 0 satisfy Assumption 3 with λ = (µ/4) (1/4). In addition, we assume that (m(t))t 0 and (β(t))t 0 satisfy Assumption 4. Then, we have

almost surely, as t .

Proof For any n 0, by Corollary 14, we have the following iteration inequality, ˆqτ β n+1 ξτ β n+1

e µ(τ β n+1 τ β n )/2 ˆqτ β n ξτ β n

+ C(1) L,µm(τ β n ) ˆqτ β n ξτ β n

+ ξτ β n θi(τ β n )

and ˆpτ β n+1

e µ(τ β n+1 τ β n )/2 + C(1) L m(τ β n ) ˆpτ β n

+ C(1) L,µ ˆqτ β n ξτ β n

+ ξτ β n θi(τ β n )

Losing Momentum in Stochastic Optimisation

dt (t) = 1 η(β 1(t)), we have

τ β n+1 τ β n = β 1(τn+1) β 1(τn) (τn+1 τn)/η(β 1(τn+1)).

Hence, under the event W α,n k , for n k, we have ˆqτ β n+1 ξτ β n+1

e µ(τ β n+1 τ β n )/2 ˆqτ β n ξτ β n

+ C(1) L,µm(τ β n ) ˆqτ β n ξτ β n

+ ξτ β n θi(τ β n )

e α1(τn+1 τn) n + C(1) L,µα2e α3 n ˆqτ β n ξτ β n

+ C(1) L,µα2e α3 n ˆpτ β n

+ KΦ,ξ0α2e α3 n

and ˆpτ β n

e α1(τn+1 τn) n + C(1) L,µα2e α3 n ˆpτ β n

+ C(1) L,µ ˆqτ β n ξτ β n

+ ξτ β n θi(τ β n )

Since τn+1 τn is independent of ˆqτ β n ξτ β n

and W α,n 1 k , for n k, we have

E h ˆqτ β n+1 ξτ β n+1

i γN γN + α1 n + C(1) L,µα2e α3 n E h ˆqτ β n ξτ β n

11W α,n 1 k

+C(1) L,µα2e α3 n E h ˆpτ β n

11W α,n 1 k

i + KΦ,ξ0α2e α3 n

E h ˆpτ β n+1

i γN γN + α1 n + C(1) L,µα2e α3 n E h ˆpτ β n

11W α,n 1 k

+ C(1) L,µE h ˆqτ β n ξτ β n

11W α,n 1 k

Aβ n := E h ˆqτ β n ξτ β n

i , Bβ n := E h ˆpτ β n

Let Dβ n = Aβ n + α2e α3 n Bβ n. We can ﬁnd some constant 0 < Cγ,N,L,µ < 1 such that for n k and k large enough

Dβ n+1 1 Cγ,N,L,µ n

Dβ n + CΦ,ξ0,Le α3 n. (30)

Let Gn k := Qn i=k 1 Cγ,N,L,µ

with Gn n = 1 Cγ,N,L,µ n 1

2. From (30), we have

Dβ n+1 Gn k Dβ k + 2CΦ,ξ0,L

Jin, Latz, Liu, Scagliotti

From Lemma 35, we know that there exist some constant c > 0 such that Gn i e c( n

i). This implies

Dβ n+1 Ce c n Dβ k + 2CΦ,ξ0,L

Ce c n Dβ k + 2CΦ,ξ0,L

Ce c n Dβ k + 2CΦ,ξ0,Lne c α3 n.

n=k P ˆqτ β n ξτ β n

11W α,n 1 k ε

n=k ne c α3 n < + .

Then, the Borel Cantelli Lemma implies that

ˆqτ β n ξτ β n

11W α,n 1 k = 0,

almost surely. Under the Assumption 4, we have

lim k 11(W α, k )c = 0,

almost surely. Since ˆqτ β n ξτ β n

= ˆqτ β n ξτ β n

11W α,n 1 k + ˆqτ β n ξτ β n

11(W α,n k )c

ˆqτ β n ξτ β n

11W α,n 1 k + ˆqτ β n ξτ β n

11(W α, k )c,

for k large enough, we have

ˆqτ β n ξτ β n

almost surely. Similarly, we have

lim n e α3 n ˆpτ β n

almost surely. Under the event W α,n k , m(τ β n ) is smaller than α2e α3 n. Hence, we have

lim n m(τ β n ) ˆpτ β n

almost surely. Since limn τ β n = + and

ˆqt ξt e µ(t τ β Nt)/2 ˆqτ β Nt ξτ β Nt

+ C(1) L,µm(τ β Nt) ˆqτ β Nt ξτ β Nt

+ ξτ β Nt θ i(τ β Nt)

Losing Momentum in Stochastic Optimisation

where τ β Nt t τ β Nt+1, we have

ˆqt ξt (1 + C(1) L,µm0) ˆqτ β Nt ξτ β Nt

+ m(τ β Nt) ˆpτ β Nt

+ ξτ β Nt θ i(τ β Nt)

This implies that ˆqt ξt 0, almost surely, as t .

Using this result and prior knowledge about the stochastic gradient process, we can now ﬁnally show convergence to the minimiser of Φ.

Corollary 19 Under the same conditions as Theorem 18, we have

in probability, as t , where θ is the unique minimiser of the function Φ.

Proof From Latz (2021, Theorem 4), we know that ξt converges to θ weakly, as t . This is equivalent to ξt θ 0 in probability since θ is deterministic. Then the result follows by applying Theorem 18.

Hence, we have shown that when reducing indeed, losing momentum and decreasing the learning rate over time, we have convergence of the stochastic gradient-momentum process to the minimiser θ . We ﬁnish this section, by discussing the non-trivial Assumption 4. Indeed, we give an example below for how (β(t))t 0 and (m(t))t 0 can be chosen to satisfy the assumption.

Example 1 Let β(t) = t2 and m(t) = m0e λt, where 0 < m0 < 1 is a constant. In this case η(t) = 2t and β 1(t) =

Ωα n = n µ 2η(β 1(τn+1)) α1 n and m(τ β n ) α2e α3 no

= n µ 4 τn+1 α1 n and m0e λ τn α2e α3 no .

By a concentration inequality, we have

= P e 2γNτn e n en E[e 2γNτn] = en

= P eγNτn/2 en e n E[eγNτn/2] = 2n

When n is suﬃciently large, n 2γN τn 2n γN already implies that the event Ωα n occurs for α1 = 2 µ 2γN , α2 = m0, α3 = λ 2γN . Hence, when n is suﬃciently large, (Ωα n)c {τn

n 2γN } {τn 2n

γN }. Therefore, from (31) and (32), we have

n=1 P(Ωc n)

The Borel-Cantelli Lemma then implies that limk + P(S i=k(Ωα i )c) = 0 which is equivalent to Assumption 4.

Jin, Latz, Liu, Scagliotti

4. Discretisation of the continuous-time system

After the previous theoretical study of stochastic gradient-momentum processes, we now propose a numerical scheme to discretise those dynamical systems. This is crucial to turn the continuous-time dynamics into practical algorithms, see the discussion in Scagliotti and Colli Franzone (2022); Shi et al. (2019); Wang and Sirignano (2021, 2022, 2024): the continuous-time analysis given in Sections 2 and 3 is only worthwhile, when the discrete algorithm behaves similarly to the continuous-time dynamics. The most usual discretisation of SGMP with an explicit symplectic Euler method would yield an algorithm very similar to stochastic gradient descent with classical momentum. However, this is hardly appropriate as explicit schemes are unstable when the mass is small or when it decreases over time. In the following, we aim to obtain a stable, symplectic, and eﬃcient discretisation strategy that retains the correct longtime behaviour for the deterministic parts of SGMP. We note that even in the decreasing mass case, we are committed to using a symplectic solver to retain local conversation laws in the dynamical system, see Mc Lachlan and Stern (2024). First, we discuss the discretisation of the underdamped gradient ﬂow with constant or decreasing mass. For convenience, we brieﬂy recall the ODE of interest:

dqt = ptdt, m(t)dpt = Φ(qt)dt αptdt, p0, q0 X. (33)

Since our theoretical analysis investigates the case m(t) 0 as t , we need the discretisation scheme to especially be stable for small values of m(t). In this framework, a fully explicit method, such as the forward or symplectic Euler method, would not be suitable for the purpose. In particular, since the evolution of t 7 pt is directly aﬀected by the value of the mass m(t), it is natural to consider an implicit discretisation for this variable: mn pn+1 pn

h = ( Φ(qn) + αpn+1),

for every n 0, yielding

pn+1 = 1 mn/h + α

h pn Φ(qn) , (34)

where h > 0 denotes the discretisation step-size and mn := m(nh). For the q variable in (33) we use the scheme qn+1 = qn + hpn+1 (35)

for every n 0. Finally, combining (34) and (35), we obtain the following update rule for the discretised dynamical model: ( pn+1 = 1 mn/h+α mn

h pn Φ(qn) ,

qn+1 = qn + hpn+1, (36)

for every n 0. As we discretise q explicitly and p implicity, this discretisation scheme is a semi-implicit method. Moreover, the discretisation scheme employed here is symplectic (Hairer et al., 2003, Theorem VI.3.3), so that the discretised dynamics satisfy (local) conservation laws. For more details, we refer the reader to the discussion in (Hairer et al., 2003, Section VI.7).

Losing Momentum in Stochastic Optimisation

Remark 20 As suggested in the previous sections and depicted in Figure 3, the parameter m can be interpreted as an interpolation variable between the gradient ﬂow and the underdamped gradient ﬂow. We observe that this is still the case for the discretised dynamical system (36). Indeed, if we set α = 1 and let m 0, (36) turns into ( pn+1 = Φ(qn), qn+1 = qn + hpn+1,

which is exactly the gradient descent method.

We ﬁnish this section with a comparison of the semi-implicit discretisation of the underdamped gradient ﬂow with other momentum-based optimisation methods. First, we note that we can rephrase the update rule (36) to obtain a recursive deﬁnition of qn that does not involve the velocity variable pn. Namely, from (36) we deduce that

qn+1 = qn + mn mn/h + αpn h mn/h + α Φ(qn).

Then, recalling that hpn = qn qn 1, we obtain that

qn+1 = qn + mn mn + αh(qn qn 1) h mn/h + α Φ(qn). (37)

This allows us to easily compare our scheme with other momentum-based optimisation schemes. In particular, we recall that Polyak s Heavy Ball method has constant coeﬃcients in front of the momentum term qn qn 1 and in front of Φ(qn). Moreover, when writing Nesterov s update rule in single-variable form, we obtain:

qn+1 = qn + ξn(qn qn 1) h Φ(qn) ξnh( Φ(qn) Φ(qn 1)),

with ξn 1 µh

1+ µh when Φ is µ-strongly convex or ξn = n n+3 when Φ is just convex. Our scheme in (37) has a structure that is reminiscent of the one considered in Attouch et al. (2019a,b), where a time rescaling was introduced in the continuous-time dynamics, yielding a non-constant (and diverging) multiplicative factor in front of the gradient of the objective. However, it is important to note that in Attouch et al. (2019a,b) the gradient is evaluated at qn+1, while in (37) we have Φ(qn).

4.1 Step-size choice

The ﬁrst natural question about the discrete-time optimisation method (36) concerns the choice of the discretisation step-size h > 0. We derive a heuristic rule, assuming for simplicity that mn = m for every n 1. From (36), we obtain

qn+1 = qn + h2

h pn 1 Φ(qn 1) Φ(qn)

Jin, Latz, Liu, Scagliotti

for every n 0. With a backward induction argument, assuming that p0 = 0, we deduce that

qn+1 = qn h2

The previous identity suggests that the position at the step k + 1 is obtained through k + 1 evaluations of the gradient at the previous points of the discrete trajectory. Moreover, each evaluation is weighted by a power of the forgetting coeﬃcient m m+αh < 1. Therefore, it is reasonable to ask that the sum of the weights is of order 1

L, where L > 0 is the Lipschitz constant of Φ, see Assumption 1. In particular, we require

Taking the limit as n in the previous sum, we obtain

or equivalently h α

We ﬁrst note that the expression at the right-hand side of (38) does not depend on the parameter m. Thus, according to the empirical argument presented above, there is no need to adjust the magnitude of the step-size according to m, also not if m changes over time. This fact deviates from the usual discretisation of the t 7 pt variable in (33). The second observation is that, we not only see convergence of our discrete method to gradient descent as m 0, see Remark 20. We also recover the correct step-size for that gradient descent scheme we converge to. We end this section with a remark on a rescaling of the potential Φ.

Remark 21 Let us assume that the potential Φ is multiplied by a positive constant ρ > 0, i.e. Φ = ρ Φ. A natural question is how we should rescale the parameters of the method (36) such that the sequence of positions (q n)n 0 coincides with the sequence (qn)n 0 corresponding to the original objective. It turns out that it is suﬃcient to set h = h/ρ and m n = mn/ρ. Indeed, if we we consider the sequences (q n)n 0 and (p n)n 0 obtained with

p n+1 = 1 m n

m n h p n Φ (q n) , p 0 = 0

q n+1 = q n + h p n+1, q 0 = q0,

then a direct computation yields

q n = qn, p n = ρpn

for every k 1, where the sequences (qn)n 0 and (pn)n 0 are obtained using (36) with the original objective Φ and the parameters h and mn.

Losing Momentum in Stochastic Optimisation

4.2 Randomised version of the method

The stochastic dynamical systems discussed throughout this work consist of piecewise deterministic ODEs where the pieces are determined by random waiting times and a subsampling process. After having discussed the ODE discretisation in the previous subsections, we now incorporate the stochasticity into the discrete-time setting. In this regard, the random waiting time for the switching of the objective despite being a cornerstone of the theoretical analysis does not seem natural in view of practical implementations. The use of random waiting times in practice within this framework has been discussed by Jin et al. (2023); Latz (2021). In the present work, we replace those random waiting times by deterministic waiting times that coincide with the time steps of the algorithm. This is exactly the well-established paradigm of the classical stochastic gradient descent method and of the classical momentum method as well. The waiting times may either be constant-in-time (homogeneous) or decreasing-in-time (heterogeneous).

Throughout this work, we have (implicitly) considered the case in which we choose one potential Φi(t) at any time t 0. Of course, we can also employ batch subsampling. There, we have ( pn+1 = 1 mn

qn+1 = qn + hpn+1, (39)

r=1 Φir(qn), (40)

and {i1, . . . , iℓ} is a subset of I that is sampled uniformly without replacement, i.e. we do not choose a single but rather a set of ℓpotentials at once and optimise with respect to their sample mean. This setting is very useful in practice, as it reduces the subsampling error. This setting is also already fully contained in our theory. To see this, we can just deﬁne a new set of potentials (Φ K)K I , where I := {K I : #K = ℓ} and Φ K = 1

ℓ P i K Φi for K I . Then, the mean of the (Φ K)K I is Φ. What is not contained in our theory, but still helpful in practice, is batch subsampling in epochs, where at any time, we do not replace elements from which we sample in I before we have picked each index once. This setting is not trivially contained in our theory due to its non-Markovian nature.

5. Numerical Experiments

We now present numerical experiments in which we test the discretisation strategy proposed in Section 4. We start with academic convex and non-convex examples. Then, we employ SGMP for the training of a convolutional neural network regarding the classiﬁcation of the CIFAR-10 data set. We aim to show how the method compares with the classical momentum method and standard stochastic gradient descent.

Jin, Latz, Liu, Scagliotti

5.1 One-dimensional non-smooth stationary point

The ﬁrst numerical experiment involves the one-dimensional example discussed in Subsection 1.2. We recall that we want to minimise the function

Φ(x) := (Re LU(x) 1)2 + x2 =

( x2 + 1 if x 0 2x2 2x + 1 if x > 0 (x R),

which attains the global minimum at the point x = 1

2. The point x = 0 is a non-smooth stationary point, since Φ(x) Φ( x) for every x x. In Subsection 1.2, we are able to show that the underdamped gradient ﬂow can overcome the local minimiser x if α2 8m < 0. We now test this property using our discrete-time method (36) with the step-size constant along the iterations. The discretisation scheme asks us to choose the learning rate h according to the Lipschitz constant of Φ . Actually, the derivative of Φ is not continuous. However, in this framework, by Lipschitz constant of Φ we mean L0 = max{Lip(Φ |x 0), Lip(Φ |x 0)}. We test the scheme in this example using the precise Lipschitz constant and overestimated Lipschitz constants. The results are presented in Figure 4. We see that the condition α2 8m < 0 is also relevant for the discrete dynamical system. The inequality appears to be sharp whenever we overestimate the Lipschitz constant unsurprisingly as a smaller step-size leads to a more accurate discretisation of the underdamped gradient ﬂow. Using the correct Lipschitz constant lets α2 8m < 0 appear quite conservative. Indeed, a much larger range of α, m allow for convergence to the global minimiser. A larger step-size makes the method more robust.

Figure 4: The plots depict for which combinations of m, α the discrete-time version of (4) derived in (36) manages to overcome the false minimiser . The blue crosses represent convergence to the global minimiser of (5), the red ones to the origin. The black curve divides the m, α that do and do not satisfy (7). As the stepsize h = 1 L gets smaller, the theoretical prediction (7) becomes more accurate. Finally, we observe that the gradient method (that corresponds to m = 0) never converges to the global minimiser.

5.2 Strongly convex example: quadratic objective

We now consider a target function of the form Φ = 1

N PN i=1 Φi, where

2 x T Aix + Nb T i x,

Losing Momentum in Stochastic Optimisation

and where Ai RK K is a symmetric and positive deﬁnite matrix and bi RK, for i = 1, . . . , N. Importantly, all of the (Φi)i I are strongly convex. We consider K = 500 (dimension of the domain), and N = 100. For every i = 1, . . . , N, we have sampled the eigenvalues of the matrix Ai using a uniform distribution in [0.05, 15], and we have obtained bi using a normal distribution centered at the origin and with standard deviation σ = 2. We look at a total of 100 diﬀerent randomly generated problems and later average over the results. We compare SGD and our method (36). At each iteration we use a mini-batch of ℓ= 10 elements of {Φ1, . . . , ΦN} to compute the stochastic approximation of the gradient of Φ. Finally, the discretisation step-size has been chosen polynomially decreasing, namely at the n-th iteration we set hn = h0/n. This is to provide a numerical simulation for the theoretical framework analysed in Section 3.2, where the re-scaling of the process (i(t))t 0 has the eﬀect of progressively decreasing the learning rate. The results are reported in Figure 5. As we can see, SGMP shows a faster convergence than the classical SGD scheme. In this case, decreasing the mass parameter m leads to a deterioration of the performances. Indeed, consistently with the theoretical predictions, the behaviour of SGMP gets closer to the stochastic gradient process as m diminishes. We also study the case where the mass is non-constant and decreases over time as mk = m0(0.995)k. In that situation, the decrement of the mass leads to a slower convergence as well.

0 200 400 600 800 1000 100

Quadratic, strongly convex objective

SGD Momentum, =5, m=1 Momentum, =5, m=0.5 Momentum, =5, m=0.25

0 200 400 600 800 1000

Quadratic, decreasing mass

SGD Momentum, =5, m0=1

Momentum, =5, m0=0.5

Momentum, =5, m0=0.25

Figure 5: Convergence rate comparison. The plot represents the decreasing distance from the true minimiser achieved by the SGD and SGMP introduced in (39). We consider the case where the mass is constant, but the step-size (hn) n=0 is decreasing (left) and the case where the mass is reduced as mk = m0(0.995)k and the stepsize decreases as before (right, see Subsection 3.1). The experiments are repeated 100 times (always resampling the potentials), and we report the mean distance achieved by each method, and the corresponding standard deviation.

Jin, Latz, Liu, Scagliotti

5.3 Non-convex example: polynomial function

We studied the behaviour of our SGMP in the case of a polynomial non-convex function given as the sum Φ = 1

N PN i=1 Φi, where for every i = 1, . . . , N

2 x T Aix + Nbix + 1

and where Ai RK K is a symmetric and positive deﬁnite matrix and bi RK. The problem is non-convex, since the Hessian of Φ (as well as the one of each Φ1, . . . , ΦN) is negative deﬁnite at the origin x = 0. On the other hand, outside a large enough compact set, the objective function Φ is locally convex. We considered K = 500 (dimension of the domain), and N = 100. For every i = 1, . . . , N, we have sampled the eigenvalues of the matrix Ai using a uniform distribution in [0.05, 15], and we obtained bi using a normal distribution centered at the origin and with standard deviation 2. At each iteration we use a mini-batch of ℓ= 5 elements of {Φ1, . . . , ΦN} to compute the stochastic approximation of the gradient of Φ, and we keep the learning rate constant along the iterations. We have compared the SGMP with SGD. In this case, the learning rate is kept constant during the iterations. The results are reported in Figure 6. In this case it seems that the stochastic momentum method tends to stabilise in correspondence of lower values of the objective function. Interestingly, we observe that the implementations with smaller m have a faster decay in the initial iterations as opposed to the results obtained in the previous subsection.

0 100 200 300 400 500 -3.5

0.5 105 Noncovex polynomial, constant mass

SGD Momentum, =0.2, m=1 Momentum, =0.2, m=0.5 Momentum, =0.2, m=0.25

0 100 200 300 400 500 -3.5

0.5 105 Noncovex polynomial, decreasing mass

SGD Momentum, =0.2, m0=1

Momentum, =0.2, m0=0.5

Momentum, =0.2, m0=0.25

Figure 6: Decay of objective comparison. The plot represents the decreasing objective function achieved by the SGD and SGMP introduced in (39). We consider the constant-mass regime (left) and the exponentially decreasing case mk = m0(0.995)k (right). Experiments are run 100 times, and we reported the mean objective decrease achieved by the methods, and the respective standard deviations.

5.4 Convolutional Neural Network (CNN)

We now employ the discrete SGMP method (36) to solve the CIFAR-10 (Krizhevsky, 2009), image classiﬁcation task with a convolutional neural network (CNN). The CIFAR-10 data

Losing Momentum in Stochastic Optimisation

set consists of 6 104 colour images (32 32 pixels) which are split into 5 104 training images and 104 test images. CIFAR-10 has 10 classes (e.g. airplane, dog, frog,...) with 6000 images per class. In the classiﬁcation task, images with known class are used to train the CNN to automatically recognise the class of any image. We use a VGG-like CNN architecture (Simonyan and Zisserman, 2015). More precisely, we use 3 3 kernels with depth 32, 64, and 128. The network contains 6 convolutional layers, each of them followed by the Re LU activation, batch normalisation, max-pooling, and drop-out layers. The train data is augmented as in Lee et al. (2015), that is a random horizontal ﬂip and a 32 32 random crop after a 4 4 padding. The training is done with Google Colab using GPUs (often Tesla V100, sometime Tesla A100). We compare SGD, classical momentum, and SGMP (36) with diﬀerent parameters, constant mass, and decreasing mass. The experiments are set in the following way. We train for 800 epochs with batch size ℓ= 100 and no weight decay. We use constant learning rate η = 0.01 for SGD and the classical momentum. In classical momentum, we set the momentum hyperparameter ρ = 0.9. See Figure 7 for the plots of train loss for constant mass. In Figure 7, note that with ﬁxed α and h, h SGMP performs better with small mass m. We have discussed the stability of (36) with small m and (relatively) large stepsize in 4.1. In addition, when m 0, the red line (h SGMP) converges to the blue line (SGD), which aligns with Theorem 6. Furthermore, we observe that with ﬁxed step-size h, a small ratio of α and m gives better results suggesting a possible rule for choosing hyperparameter in practice. For decreasing mass, see Figure 8 for the plots of train loss for dm SGMP. The decreasing rate is set to be m = m00.995k, where k is the number of iterations and m0 = 0.1 is the initial mass. The train loss for each epoch is calculated by averaging over batches. In Figure 8, dm SGMP achieves a similar train loss as the classical momentum method. They might arrive at diﬀerent minimisers, but this shows that dm SGMP is a practical algorithm for machine learning optimisation that we are able to lay a solid analytical foundation for. For this particular task, dm SGMP has attained a desirable result without decreasing learning rate from Corollary 19 we would expect a behaviour similar to SGD. For larger datasets and more complicated tasks that we are not able to experiment due to lack of resource, we conjecture that dd SGMP may be an eﬀective optimisation method with proper choice of hyperparameters. See Table 1 for the train and test accuracy. Accuracy is measured using the model obtained after 800 epochs of training. Overall, we conclude that h SGMP achieves competitive test accuracy compared with classical momentum. When decreasing the mass, dm SGMP achieves competitive train loss compared to classical momentum.

6. Conclusions

In this work, we have proposed and analysed the stochastic gradient-momentum process, a continuous-time dynamics representing momentum-based stochastic optimisation. We have especially analysed limiting behaviour when reducing learning rate and/or particle mass. In this context, learning rate and particle mass can either be reduced homogeneously or decrease over time. We have shown pathwise or longtime convergence to the underlying gradient ﬂow or the stochastic gradient process, respectively. We have then proposed a stable discretisation strategy for the stochastic gradient-momentum process and tested the strategy in several numerical examples. In those, we especially saw that the stable dis-

Jin, Latz, Liu, Scagliotti

0 100 200 300 400 500 600 700 800 Epoch

= 1, m = 0.1, h = 1

0 100 200 300 400 500 600 700 800 Epoch

= 10, m = 0.1, h = 1

0 100 200 300 400 500 600 700 800 Epoch

= 15, m = 0.1, h = 1

SGD Classical Momentum h SGMP

600 700 800 0.250

600 700 800 0.250

600 700 800 0.250

0 100 200 300 400 500 600 700 800 Epoch

= 10, m = 0.15, h = 1

0 100 200 300 400 500 600 700 800 Epoch

= 10, m = 0.1, h = 1

0 100 200 300 400 500 600 700 800 Epoch

= 10, m = 10, h = 1

SGD Classical Momentum h SGMP

600 700 800 0.250

600 700 800 0.250

600 700 800 0.25

0 100 200 300 400 500 600 700 800 Epoch

= 10, m = 0.1, h = 0.1

0 100 200 300 400 500 600 700 800 Epoch

= 10, m = 0.1, h = 1

0 100 200 300 400 500 600 700 800 Epoch

= 10, m = 0.1, h = 2

SGD Classical Momentum h SGMP

600 700 800 0.250

600 700 800 0.250

600 700 800 0.250

Figure 7: Train loss comparison for SGD (η = 0.01), classical momentum (η = 0.01, ρ = 0.9), and h SGMP for CNN on CIFAR-10. We vary α, m and h in each experiment, which are speciﬁed in the title of each plot.

cretisation of the stochastic gradient-momentum process can achieve a similar accuracy in a CNN training compared with (the possibly unstable) stochastic gradient descent with classical momentum algorithm.

Most of the theoretical results we have obtained throughout this work refer to the setting of convex optimisation. Convex optimisation is vital in, e.g. image reconstruction. The training of neural network usually requires non-convex optimisation and momentum-based methods are especially popular in non-convex settings. Hence, a natural future research direction are non-convex optimisation problems. The stochastic gradient-momentum process does not represent the adaptivity in the Adam algorithm. To represent the adaptivity, we would need to study a two-sided dependence between (i(t))t 0 and (p(t), q(t))t 0 and a non-linear weighting in front of the gradient. Both these additions to the stochastic gradient-momentum process are a very interesting and challenging direction for future research.

Losing Momentum in Stochastic Optimisation

0 100 200 300 400 500 600 700 800 Epoch

SGD Classical Momentum h SGMP dm SGMP

700 750 800

Figure 8: Train loss comparison for SGD (η = 0.01), classical momentum (η = 0.01, ρ = 0.9), h SGMP (α = 10, m = 0.1, h = 1), and dm SGMP (α = 10, m0 = 0.1, m = m00.995k, h = 1) for CNN on CIFAR-10.

Method Parameters Train Acc. Test Acc. SGD η = 0.01 96.902 90.98 Classical momentum η = 0.01, ρ = 0.9 97.236 91.33 h SGMP α = 1, m = 0.1, h = 1 97.096 91.09 α = 10, m = 0.1, h = 1 96.986 91.25 α = 15, m = 0.1, h = 1 96.822 91.09 α = 10, m = 10, h = 1 96.556 90.75 α = 10, m = 0.15, h = 1 97.35 91.35 α = 10, m = 0.1, h = 2 96.984 90.87 α = 10, m = 0.1, h = 0.1 97.108 91.26 dm SGMP α = 10, m0 = 0.1, h = 1 97.288 91.32

Table 1: Comparison of train and test accuracy (%) with diﬀerent hyperparameters over the CIFAR-10 dataset.

Acknowledgments

A.S. acknowledges partial support from INd AM GNAMPA.

Appendix A. Longtime behavior of the underdamped gradient ﬂow

We recall that the underdamped gradient ﬂow (15) is deﬁned as the following dynamical system

dqt = ptdt, dpt = Φ(qt)dt αptdt, p0, q0 X.

We now show that the position in the underdamped gradient ﬂow converges to the global minimiser of the objective function Φ and the velocity converges to 0. We shall assume that Φ satisﬁes Assumption 2 with Aλ,α for some λ and α deﬁned in (15). First, we note

Jin, Latz, Liu, Scagliotti

that since Φ achieves a global minimum at θ , θ is a critical point of Φ. Then, (13) implies that θ is the unique critical point of Φ. We use a technique similar to that of, e.g. Aujol et al. (2022, 2023); Eberle et al. (2017); Sebbouh et al. (2020). Indeed, we introduce the Lyapunov function V : Rn Rn R given by

V (x, y) := Φ(x) Φ(θ ) + α2

x θ + α 1y 2 + α 1y 2 λ x θ 2 (41)

which we employ below. Using V , we can show Lyapunov exponential global stability for the underdamped gradient ﬂow (15).

Proposition 22 Assume that Φ satisﬁes Assumption 1 and 2 with Aλ,α. Let (pt, qt)t 0 be the solution to (15) and V be the Lyapunov function deﬁned in (41). Then we have the following inequality

V (qt, pt) V (q0, p0)e αλt (t 0).

Proof Notice that (13) implies that d V (qt,pt)

dt αλV (qt, pt). By Gr onwall s inequality, we obtain V (qt, pt) V (q0, p0)e αλt.

In the previous proposition, we show that the Lyapunov function V (qt, pt) is bounded above by a number that exponentially decreases in t > 0. V (qt, pt) is zero, if qt = θ and pt = 0, i.e. the particle is positioned at the unique minimiser of Φ and the particle velocity is 0: the dynamical system has reached a stationary state. However, V can have multiple zeros. Thus, we need some more work to show exponential convergence of the dynamical system.

Corollary 23 Under the conditions of Proposition 22, we have

pt 2 + qt θ 2 Cα,λe αλt V (q0, p0) (t 0).

Proof Since λ 1

4, we have, by the ε-Young inequality:

V (x, y) α2

(1 λ) x θ 2 + 2 α 1y 2 + 2(x θ ) (α 1y)

(1 λ) x θ 2 + 2 α 1y 2 2 α 1y 2 1

2 λ x θ 2 ,

which implies that

qt θ 2 8 α2(1 2λ)V (qt, pt) 8 α2(1 2λ)V (q0, p0)e αλt.

Similarly, again by the ε-Young inequality, we have for pt that

V (x, y) α2

(1 λ) x θ 2 + 2 α 1y 2 + 2(x θ ) (α 1y)

(1 λ) x θ 2 + 2 α 1y 2 1 1 λ

α 1y 2 (1 λ) x θ 2

Losing Momentum in Stochastic Optimisation

which implies

pt 2 4(1 λ)

1 2λ e αλt V (q0, p0) e αλt V (q0, p0).

Therefore, we conclude that

qt θ 2 + pt 2 Cα,λe αλt V (q0, p0),

where Cα,λ = 4 + 8 α2(1 2λ).

Appendix B. Decreasing momentum: the ﬁxed-sample case

We now consider the SGMP dynamics subject to a ﬁxed sample i I. More speciﬁcally, we deﬁne

dqi,m t = pi,m t dt, mdpi,m t = Φi(qi,m t )dt pi,m t dt, pi,m 0 = pi 0 X, qi,m 0 = qi 0 X,

with mass m > 0. When m 0, we have the following formal limiting equation dθi t = Φi(θi t)dt, θi 0 X. (43)

This is the gradient ﬂow with respect to the potential Φi for some i I. To prove this described limiting behaviour, we require three auxiliary results. In the ﬁrst one we study, similarly to Corollary 23, the longtime behaviour of the deterministic underdamped dynamical system (42). This time with emphasis on the inﬂuence of the mass m. Interestingly, we can see that the convergence rate is independent of the mass m, if it is suﬃciently small. We note that these results are similar to previous results by Attouch et al. (2000).

Lemma 24 Let (pi,m t , qi,m t )t 0 be the solution to (42). Let Φi satisfy Assumption 1 and 2 with Aλi,1 and critical point θi . We set λ := mini I λi. Then for 0 < m 1, we have the following inequality qi,m t θi 2 16(L + 2)e λt( qi 0 θi 2 + m2 pi 0 2) (t 0), (44)

where L is the Lipschitz constant of the Φi.

Proof We recall Assumption 2 for Φi with α = 1:

(x θi ) Φi(x)/2 λ(Φi(x) Φi(θi ) + x θi 2 /4),

which implies that

(x θi ) Φi(x)/(2m) λ(m 1Φi(x) m 1Φi(θi ) + m 1 x θi 2 /4).

Jin, Latz, Liu, Scagliotti

And since 0 < m 1, we have

λ(m 1Φi(x) m 1Φi(θi ) + m 1 x θi 2 /4)

= λm(m 2Φi(x) m 2Φi(θi ) + m 2 x θi 2 /4)

λm(m 1Φi(x) m 1Φi(θi ) + m 2 x θi 2 /4).

So Assumption 2 implies that

(x θi ) Φi(x)/(2m) λm(m 1Φi(x) m 1Φi(θi ) + m 2 x θi 2 /4)

which exactly means that Φi(x)/m satisﬁes Assumption 2 with Aλm,m 1. We deﬁne

V i,m(x, y) = Φi(x) Φi(θi ) m + 1 4m2

x θi + my 2 + my 2 mλ x θi 2 .

From Proposition 22, we have

V i,m(qi,m t , qi,m t ) V i,m(qi 0, pi 0)e λt.

Next, we are going to show that

cm 2 x θi 2 V i,m(x, y) (L + 2) m 2 x θi 2 + y 2 ,

where c = 1

Since Φi(x) Φi(θi ) 0 and x θi + my 2 + my 2 x θi 2

2 , we have

V i,m(x, y) 1 4m2

x θi + my 2 + my 2 mλ x θi 2

2 mλ x θi 2

Notice that

Φi(x) Φi(θi ) = Z 1

0 (x θi ) h Φi (x θi )s + θi ) Φi(θ ) i ds L x θi 2

This implies that

V i,m(x, y) L x θi 2

x θi 2 + my 2

(L + 2) m 2 x θi 2 + y 2 .

Hence, we immediately get

m 2 qi,m t θi 2 16(L + 2)e λt(m 2 qi 0 θi 2 + pi 0 2),

Losing Momentum in Stochastic Optimisation

which implies that qi,m t θi 2 16(L + 2)e λt( qi 0 θi 2 + m2 pi 0 2).

In the next auxiliary result, we show boundedness of the velocity, i.e. (pi,m t )t 0 that depends on the mass m. Moreover, we show Lipschitz continuity of the particle position with respect to time. Note that the Lipschitz constant can be chosen independently of m.

Lemma 25 Let (pi,m t , qi,m t )t 0 be the solution to (42). Let Φi satisfy Assumption 1 and 2 with Aλi,1 and critical point θi . Then for 0 < m 1, we have for any 0 s t, pi,m t e t

m + CLm pi 0 + CL qi 0 θi , qi,m t qi,m s CL(t s) qi 0 θi + pi 0 (t 0),

where CL = 16(L + 2)L + 1 and L is the Lipschitz constant of Φi.

Proof We take the derivative of e t m pi,m t , from (42), we get

pi,m t = e t

m pi 0 m 1e t

0 e s m Φi(qi,m s )ds. (45)

Hence, we have

m pi 0 + m 1e t

0 e s m Φi(qi,m s ) Φi(θi ) ds

m pi 0 + m 1e t

0 e s m qi,m s θi ds

m pi 0 + 16(L + 2)Lm 1e t

m ( qi 0 θi + m pi 0 ) Z t

m + 16(L + 2)Lm pi 0 + 16(L + 2)L qi 0 θi .

Let CL := 16(L + 2)L + 1. Then we have pi,m t CL( pi 0 + qi 0 θi ). From (42), we immediately get

qi,m t qi,m s =

s pi,m m dm Z t

pi,m m dm CL(t s) qi 0 θi + pi 0 .

The third auxiliary result is a bound on the time derivative of the velocity (pi,m t )t 0, i.e. a bound on the particle s acceleration.

Jin, Latz, Liu, Scagliotti

Lemma 26 Let (pi,m t , qi,m t )t 0 be the solution to (42). Let Φi satisfy Assumption 1 and 2 with Aλi,1 and critical point θi . Then for any 0 < m 1, we have dpi,m t dt

C(0) L (1 + m 1e t/m) qi 0 θi + pi 0 (t 0), (46)

where C(0) L = 2 + 4LCL and L is the Lipschitz constant of Φi.

Proof We ﬁrst recall (45), where we have

dpi,m t dt = e t

m pi 0 m Φi(qi,m t ) m + e t

0 e s m Φi(qi s)ds. (47)

and notice that

Φi(qi,m t ) m = e t

0 e s m Φi(qi,m t )ds + e t

m Φi(qi,m t ) m . (48)

Combining (47) and (48), we get dpi,m t dt

pi 0 m Φi(qi,m t ) Φi(θi ) m + 1

0 e s m ( Φi(qi s) Φi(qi,m t ))ds

m + L qi,m t θi

0 e s m qi,m s qi,m t ds i

m pi 0 + L qi,m t θi + L

m qi,m s qi,m t ds

m pi 0 + L qi,m t θi + LCL

m (t s)ds qi 0 θi + pi 0

m pi 0 + 4L(L + 2)( qi 0 θi + pi 0 ) + LCL qi 0 θi + pi 0

C(0) L (1 + m 1e t

m ) qi 0 θi + pi 0 ,

where (a1) is correct due to (44) and the fact that R t m 0 e ssds < 1.

The previous lemmas provide the boundedness results needed for proving the following proposition. We show that the diﬀerence between the ﬁxed sample SGMP (42) and the ﬁxed sample stochastic gradient process (43) can be bounded by the sum of the distance between their initial values and a term that is linear in m. Hence, in the ﬁxed subsample case, we show that the solution of (42) converges to the solution of the limiting equation (43). In addition to the previous assumptions, we now also need to assume Φi to be convex.

Proposition 27 Let Φi be convex and satisfy Assumption 1 and 2 with Aλi,1 and critical point θi . Let (pi,m t , qi,m t )t 0 be the solution to (42). Then, for 0 < m 1, we have qi,m t θi t qi 0 θi 0 + C(0) L m(1 + t) pi 0 + qi 0 θi 0 + θi 0 θi (t 0),

where C(0) L = 2 + 4LCL and L is the Lipschitz constant of Φi.

Losing Momentum in Stochastic Optimisation

Proof We take the derivative of qi,m t θi t 2 and obtain

d qi,m t θi t 2

dt = qi,m t θi t

dqi,m t dt dθi t dt

= qi,m t θi t Φi(qi,m t ) Φi(θi t) m qi,m t θi t

m qi,m t θi t

m qi,m t θi t

(46) C(0) L m(1 + m 1e t

m ) qi 0 θi + pi 0 qi,m t θi t ,

which implies

d qi,m t θi t

dt = 2 qi,m t θi t 1 d qi,m t θi t 2

C(0) L m(1 + m 1e t

m ) qi 0 θi + pi 0 .

Integrating both sides, we get

qi,m t θi t qi 0 θi 0 + C(0) L

m )ds pi 0 + qi 0 θi

= qi 0 θi 0 + C(0) L mt + m(1 e t

m ) pi 0 + qi 0 θi

qi 0 θi 0 + C(0) L m(1 + t) pi 0 + qi 0 θi

qi 0 θi 0 + C(0) L m(1 + t) pi 0 + qi 0 θi 0 + θi 0 θi .

Appendix C. Other auxiliary results

Example 2 Let Φ be a function on R satisfying

x2, x ( , 1], 2x 1, x (1, 2], 1 2x2 + 1, x (2, + ).

Φ satisﬁes Assumption 2 with λ (0, 2/(8 + α2)]. As shown in Figure 9, Φ is convex but not strongly convex since it is linear on (1, 2].

Lemma 28 Let Φ C1(X, R) be strongly convex with constant µ. Then Φ satisﬁes Assumption 2 with A(µ/α2) (1/4),α.

Jin, Latz, Liu, Scagliotti

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x

Figure 9: Plot of function Φ in Example 2.

Proof Since Φ is strongly convex, we have

(x θ ) ( Φ(x) Φ(θ )) µ x θ 2 .

By Lagrange s mean value theorem, there exist some ξ lying on the line between x and θ , such that Φ(x) Φ(θ ) = (x θ ) Φ(ξ).

Let ξ = θ + t(x θ ) for some 0 t 1. By strong convexity, we have

(x ξ) ( Φ(x) Φ(ξ)) µ x ξ 2 .

Replacing ξ by θ + t(x θ ), we get

(x θ ) ( Φ(x) Φ(ξ)) (1 t)(x θ ) ( Φ(x) Φ(ξ)) µ x ξ 2 .

This implies

(x θ ) Φ(x) (x θ ) Φ(ξ) + µ x ξ 2 Φ(x) Φ(θ ).

(x θ ) Φ(x)/2 ( Φ(x) Φ(θ ))/4 + µ x θ 2 /4).

Then, we choose λ = (µ/α2) (1/4) and the proof is completed.

Lemma 29 We assume that Φi is convex and satisﬁes Assumption 1 for all i I. Let (θm t )t 0 be the solution to system (16). Then we have

θm t θ0 + CΦt (t 0).

Proof We diﬀerentiate θm t 2 and obtain

dt = 2 (θm t ) dθm t dt

= 2 (θm t ) Φim,δ(t)(θm t )

= 2 θm t , Φim,δ(t)(θm t ) Φim,δ(t)(0) ( ) 2 (θm t ) ( Φ(0))

2 θm t Φim,δ(t)(0) 2CΦ θm t ,

which implies d θm t dt CΦ and, hence, θm t θ0 + CΦt.

Losing Momentum in Stochastic Optimisation

Lemma 30 We assume that Φi is strongly convex with µ > 0 for all i I. Let (θt)t 0 be the solution to (3). Then, we have

θt θ0 e µt + CΦ (t 0).

Proof We diﬀerentiate θt 2,

dt 2µ θt 2 + 2 θt Φi(t)(0) µ θt 2 + 1

Φi(t)(0) 2 ,

By Gr onwall s inequality, we have

θt 2 e µt θ0 2 + 1

µ2 Φi(t)(0) 2

and choose CΦ = 1 µ2 supi=1,...,N Φi(0) 2.

Lemma 31 Let C and λ be two positive constants and let (tn)n 0 be a sequence with t0 0 that satisﬁes the following recurrence relation

tn+1 = tn + Ce λtn.

Then limn + tn = + .

Proof It is obvious that {tn}n 0 is a strictly increasing sequence. We now assume that limn + tn < + . let A be the smallest upper bound of (tn)n 0. Then for any m > 0, there exist n0 such that, for any n n0, tn A m. We have

tn+1 = tn + Ce λtn A m + Ce A.

Set m Ce A/2, then tn+1 > A, contradicting the assumption that tn A (n N). Hence limn + tn = + .

Lemma 32 Let (pi t, qi t)t 0 solve (20). Let θi be the critical point of Φi for i I. Under Assumption 1, 2 with Aλi,2, and 3 with λ = mini I λi, we have

V i(t, qi t, pi t) e λt V i(0, qi 0, pi 0) (t 0),

where V i(t, x, y) = m(t)[Φi(x) Φi(θi )]+ 1

4 x θi + m(t)y 2 + m(t)y 2 is a Lyapunov function.

Jin, Latz, Liu, Scagliotti

Proof First, we diﬀerentiate V i(t, qi t, pi t) and obtain by (20),

d V i(t, qi t, pi t)/dt = d(m(t)(Φi(qi t) Φi(θ )))/dt + 1

4d qi t θi + m(t)pi t 2 + m(t)pi t 2 /dt

= m (t)(Φi(qi t) Φi(θi )) + m(t) Φi(qi t) dqi t/dt

qi t θi + m(t)pi t d(qi t θi + m(t)pi t)/dt + m(t)pi t d(m(t)pi t)/dt

= m (t)(Φi(qi t) Φi(θi )) + m(t) Φi(qi t) pi t

qi t θi + m(t)pi t pi t + m (t)pi t + m(t)dpi t/dt

+ m(t)pi t m (t)pi t + m(t)dpi t/dt

(20) = m (t)(Φi(qi t) Φi(θi )) + m(t) ( Φi(qt)) pi t

qi t θi + m(t)pi t m (t)pi t Φi(qi t)

+ m(t)pi t m (t)pi t Φi(qi t) pi t

= m (t)(Φi(qi t) Φi(θi )) | {z } (α11)

2 qi t θi Φi(qi t)

+ m(t)(m (t) 1

2) pi t 2 + m (t)

2 qi t θi pi t

| {z } (α12)

Note that (α11) + m(t)m (t) pi t 2 0 since m (t) 0 and Φi(qi t) Φi(θi ) 0. By the

ε-Young s inequality, we have (α12) λ

4 qi t θi 2 + (m (t))2

4λ pi t 2 . Then we have

2 qi t θi Φi(qi t) m(t)

qi t θi 2 + (m (t))2

λ Φi(qi t) Φi(θi ) + qi t θi 2 m(t)

λ Φi(qi t) Φi(θi ) +

qi t θi + m(t)pi t 2

+ 3λm2(t) m(t)

λV i(t, qi t, pi t).

(β1) holds due to Assumption 2 and 3. In particular, Assumption 3 implies (m (t))2

4 . Moreover, (β2) follows from qi t θi +m(t)pi t 2

4 qi t θi 2+ m(t)pi t 2

2 . Finally, by Gr onwall s inequality we have

V i(t, qi t, pi t) e λt V i(0, qi 0, pi 0).

Losing Momentum in Stochastic Optimisation

Lemma 33 Let m(t) > 0 be a strictly decreasing diﬀerentiable function and satisfy Assumption 3 with λ = µ/4, where µ > 0. Then we have the following inequalities, Z t

0 m2(s)eµsds 2µ 1eµtm2(t), Z t

0 e 2E(s)eµsds (1 + µ)m0eµt.

Proof For the ﬁrst inequality, we have Z t

0 m2(s)eµsds = µ 1 Z t

0 m2(s)deµs = µ 1h m2(t)eµt m0 2 Z t

0 m(s)m (s)eµsds i ,

using integration by parts. Under the Assumption 3,

0 m(s)m (s)eµsds µ

0 m2(s)eµsds.

0 m2(s)eµsds µ 1eµtm2(t) + 1

0 m2(s)eµsds.

This implies that Z t

0 m2(s)eµsds 2µ 1eµtm2(t).

For the second inequality, we use integration by parts and notice that Z t

0 e 2E(s)eµsds = 1

0 m(s)eµsde 2E(s)

m0 m(t)eµte 2E(t) + 1

0 e 2E(s)eµs(µm(s) + m (s))ds

2 + 3µm0eµt

0 e 2E(s)ds

2 + 3µm0eµt

m0 ds (1 + µ)m0eµt.

This completes the proof.

Lemma 34 Let (mn)n 0 be a non-negative decreasing sequence with limn mn = 0. Let (dn)n 0 be a positive sequence satisfying dn+1 cdn + mn, where 0 < c < 1. Then we have limn dn = 0. Furthermore, if mn Be bn for some constant B, b > 0 we have dn Be bn.

Proof It is obvious that (mn)n 0 is bounded and we denote the upper bound by C. Hence (dn)n 0 can be bounded by D := d0 + C/(1 c). Since limn + mn = 0, for any m > 0, there exist a N0, for any N N0, mn m. Hence, for any n N0, we have

dn+1 cdn + mn cdn + m.

Jin, Latz, Liu, Scagliotti

which implies that

dn cn N0d N0 + m 1 c.

Since m is arbitrary, we get limn dn = 0, which ﬁnishes the proof of the ﬁrst part. Now, assuming that mn Be bn, the relation dn+1 cdn + mn implies that

i=0 cn imi B

i=0 cn ie bi = Bcn n X

i=0 e (log c+b)i.

If b + log c 0, dn+1 Bncn the result is obvious. For b + log c < 0, we have

dn+1 Bcn e (log c+b)(n+1) 1

e (log c+b) 1 Bcn e (log c+b)(n+1)

e (log c+b) 1 Be b(n+1).

Lemma 35 Let Gn k = Qn i=k 1 b

, where 0 < b < 1. Then there exist c > 0 such that

for any 0 k < n, we have Gn k e c( n

Proof For any 0 x < 1 we have log(1 x) x. This implies that

log Gn k = log n Y

i=k log 1 b

1 xdx = b Z n+1

Finally, by taking exponential both side and let c = b/2, we ﬁnish the proof.

Appendix D. Kushner s Perturbed Test Function Theory

In Kushner (1990, Chapter 7), the author considers a dynamical system with state space Rd of the form

γ = G(xγ) + G(xγ, ξγ) + F(xγ, ξγ)/γ, (50)

where γ > 0 is a parameter and (ξγ t )t 0 is a stochastic process. The author then studies the limiting behaviour of this dynamical system to, e.g. ODEs of type dx

dt = G(x). We now cite their main result.

Losing Momentum in Stochastic Optimisation

Theorem 36 (Kushner (1990)) We assume that assumptions (A4.2) to (A4.6) (see below) hold and let (xγ(0))γ>0 be tight. Then (xγ(t))t 0 is tight with respect to γ > 0 and the limit of any weakly convergent subsequence solves the martingale problem with the associate operator A deﬁned by

Af(x) = xf G(x) + B0f(x),

for appropriate test functions f : Rd R.

We apply this theorem only in case F = 0, which simpliﬁes the discussion in the following. For instance, we skip the deﬁnition of B0, as B0 = 0. Then, the operator A is the inﬁnitesimal generator of dx

dt = G(x) and we obtain convergence. We now list Assumptions (A4.2), (A4.3), and (A4.5). We skip Assumptions (A4.4) and (A4.6) as they are redundant if F = 0.

(A4.2): G, G, F, x F are continuous.

(A4.3): The process (ξγ t )t 0 satisﬁes ξγ t = ξt/γ2 (t 0) for a bounded and right continuous (ξt)t 0.

(A4.5): For each x Rd, as t, τ ,

t E[ G(x, ξ(s))|(ξ(s ))s t]ds 0,

almost surely.

We employ Theorem 36 in the proof of Theorem 5. There, we have ξt = i(t), γ = ν1/2, G(x, y) = (y, x Φ(x) αy), G(x, y, i) = (0, xΦi(x) + x Φ(x)) and F = 0.

H. Attouch and F. Alvarez. The heavy ball with friction dynamical system for convex constrained minimization problems. In Optimization: Proceedings of the 9th Belgian French-German Conference on Optimization Namur, September 7 11, 1998, pages 25 35. Springer, 2000.

H. Attouch, X. Goudou, and P. Redont. The heavy ball with friction method, I. The continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics, 2(01):1 34, 2000.

H. Attouch, Z. Chbani, J. Peypouquet, and P. Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Mathematical Programming, 168: 123 175, 2018.

H. Attouch, Z. Chbani, and H. Riahi. Fast proximal methods via time scaling of damped inertial dynamics. SIAM Journal on Optimization, 29(3):2227 2256, 2019a.

Jin, Latz, Liu, Scagliotti

H. Attouch, Z. Chbani, and H. Riahi. Fast convex optimization via time scaling of damped inertial gradient dynamics. Pure and Applied Functional Analysis, 2019b.

H. Attouch, A. Balhag, Z. Chbani, and H. Riahi. Fast convex optimization via inertial dynamics combining viscous and Hessian-driven damping with time rescaling. Evolution Equations and Control Theory, 11(2):487 514, 2022. doi: 10.3934/eect.2021010.

H. Attouch, R. I. Bot, and D.-K. Nguyen. Fast convex optimization via time scale and averaging of the steepest descent. Mathematics of Operations Research, 2024. doi: 10. 1287/moor.2023.0186. URL https://doi.org/10.1287/moor.2023.0186.

J.-F. Aujol, C. Dossal, and A. Rondepierre. Convergence Rates of the Heavy Ball Method for Quasi-strongly Convex Optimization. SIAM Journal on Optimization, 32(3):1817 1842, 2022.

J.-F. Aujol, C. Dossal, and A. Rondepierre. Convergence rates of the heavy-ball method with Lojasiewicz property. Mathematical Programming, 198:195 254, 2023.

H. Daneshmand, J. Kohler, A. Lucchi, and T. Hofmann. Escaping Saddles with Stochastic Gradients. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1155 1164. PMLR, 10 15 Jul 2018.

A. D efossez, L. Bottou, F. Bach, and N. Usunier. A Simple Convergence Proof of Adam and Adagrad. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.

S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, and B. Poczos. Gradient Descent Can Take Exponential Time to Escape Saddle Points. In Advances in Neural Information Processing Systems, volume 30, 2017.

J. Duchi, E. Hazan, and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(61):2121 2159, 2011.

A. Eberle, A. Guillin, and R. Zimmer. Couplings and quantitative contraction rates for Langevin dynamics. The Annals of Probability, 47, 03 2017.

E. Hairer, C. Lubich, and G. Wanner. Geometric numerical integration illustrated by the Str omer-Verlet method. Acta Numerica, pages 1 51, 05 2003.

G. Hinton. Online course in Neural Networks for Machine Learning, 2012.

C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points. J. ACM, 68(2), feb 2021. ISSN 0004-5411.

K. Jin, J. Latz, C. Liu, and C.-B. Sch onlieb. A Continuous-time Stochastic Gradient Descent Method for Continuous Data. Journal of Machine Learning Research, 24:1 48, 2023.

Losing Momentum in Stochastic Optimisation

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In International Conference on Learning Representations, 2017.

D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations, 12 2014.

A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images, 2009.

H. J. Kushner. Weak Convergence Methods and Singularly Perturbed Stochastic Control and Filtering Problems. Systems & Control: Foundations & Applications. Birkh auser Boston, MA, 1990.

J. Latz. Analysis of stochastic gradient descent in continuous time. Statistics and Computing, 31, 07 2021.

J. Latz. Gradient ﬂows and randomised thresholding: sparse inversion and classiﬁcation. Inverse Problems, 38(12), nov 2022.

J. Latz. Correction to: analysis of stochastic gradient descent in continuous time. Statistics and Computing, 34(5):146, 2024. doi: 10.1007/s11222-024-10450-4.

C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-Supervised Nets. In Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 562 570. PMLR, 09 12 May 2015.

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient Descent Only Converges to Minimizers. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1246 1257. PMLR, 23 26 Jun 2016.

Q. Li, C. Tai, and W. E. Stochastic Modiﬁed Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations. Journal of Machine Learning Research, 20(40):1 47, 2019.

Y. Liu, Y. Gao, and W. Yin. An Improved Analysis of Stochastic Gradient Descent with Momentum. In Advances in Neural Information Processing Systems, volume 34, 2020.

R. I. Mc Lachlan and A. Stern. Functional Equivariance and Conservation Laws in Numerical Integration. Foundations of Computational Mathematics, 24(1):149 177, 2024. doi: 10. 1007/s10208-022-09590-8. URL https://doi.org/10.1007/s10208-022-09590-8.

I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of ﬁrst order methods for non-strongly convex optimization. Mathematical Programming, 175:69 107, 2019.

Y. Nesterov. A method for solving the convex programming problem with convergence rate o( 1

k2 ). Soviet Mathematics Doklady, pages 372 367, 1983.

B. Neyshabur, R. Tomioka, and N. Srebro. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning, 2015. URL https://arxiv.org/abs/ 1412.6614.

Jin, Latz, Liu, Scagliotti

I. Panageas and G. Piliouras. Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions. In 8th Innovations in Theoretical Computer Science Conference. ITCS, 2017.

B. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864 878, 1963.

B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 407, 1951.

D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. Nature, 323(6088):533 536, 1986. doi: 10.1038/323533a0. URL https://doi.org/10.1038/323533a0.

A. Scagliotti and P. Colli Franzone. A Piecewise Conservative Method for Unconstrained Convex Optimization. Comput. Optim. Appl., 81(1):251 288, 2022. doi: 10.1007/s10589-021-00332-0.

O. Sebbouh, C. Dossal, and A. Rondepierre. Convergence Rates of Damped Inertial Dynamics under Geometric Conditions and Perturbations. SIAM Journal on Optimization, 30(3):1850 1877, 2020.

A. Sekhari, K. Sridharan, and S. Kale. SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs. In Advances in Neural Information Processing Systems, 2021.

B. Shi. On the Hyperparameters in Stochastic Gradient Descent with Momentum. Journal of Machine Learning Research, 25(236):1 40, 2024. URL http://jmlr.org/papers/ v25/22-1189.html.

B. Shi, S. S. Du, W. Su, and M. I. Jordan. Acceleration via Symplectic Discretization of High-Resolution Diﬀerential Equations. In Advances in Neural Information Processing Systems, volume 32, 2019.

B. Shi, S. Du, M. Jordan, and W. Su. Understanding the acceleration phenomenon via high-resolution diﬀerential equations. Mathematical Programming, 195:79 148, 01 2021.

K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations, 2015.

W. Su, S. Boyd, and E. Cand es. A Diﬀerential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights. In Advances in Neural Information Processing Systems, volume 27, 2014.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In S. Dasgupta and D. Mc Allester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139 1147, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/sutskever13.html.

Losing Momentum in Stochastic Optimisation

Z. Wang and J. Sirignano. Global Convergence of the ODE Limit for Online Actor-Critic Algorithms in Reinforcement Learning, 2021. URL https://arxiv.org/abs/2108.08655.

Z. Wang and J. Sirignano. A Forward Propagation Algorithm for Online Optimization of Nonlinear Stochastic Diﬀerential Equations, 2022. URL https://arxiv.org/abs/2207. 04496.

Z. Wang and J. Sirignano. Continuous-time stochastic gradient descent for optimizing over the stationary distribution of stochastic diﬀerential equations. Mathematical Finance, 34 (2):348 424, 2024.

M. D. Zeiler. ADADELTA: An Adaptive Learning Rate Method, 2012. URL https: //arxiv.org/abs/1212.5701.

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.

M. Zinkevich. Online Convex Programming and Generalized Inﬁnitesimal Gradient Ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, page 928 935, 2003.

D. Zou, J. Wu, V. Braverman, Q. Gu, D. Foster, and S. M. Kakade. The Beneﬁts of Implicit Regularization from SGD in Least Squares Problems. In Advances in Neural Information Processing Systems, 2021.