# on_learning_rates_and_schrödinger_operators__3bb4ec13.pdf

Journal of Machine Learning Research 24 (2023) 1-53 Submitted 4/20; Revised 12/23; Published 12/23

On Learning Rates and Schr odinger Operators

Bin Shi shibin@lsec.cc.ac.cn Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, 100190, China

School of Mathematical Sciences University of Chinese Academy of Sciences Beijing, 100049, China

Weijie J. Su suw@wharton.upenn.edu Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA 19104, USA

Michael I. Jordan jordan@cs.berkeley.edu Department of Electrical Engineering and Computer Sciences Department of Statistics University of California Berkeley, CA 94720, USA

Editor: Zaid Harchaoui

Understanding the iterative behavior of stochastic optimization algorithms for minimizing nonconvex functions remains a crucial challenge in demystifying deep learning. In particular, it is not yet understood why certain simple techniques are remarkably eﬀective for tuning the learning rate in stochastic gradient descent (SGD), arguably the most basic optimizer for training deep neural networks. This class of techniques includes learning rate decay, which begins with a large initial learning rate and is gradually reduced. In this paper, we present a general theoretical analysis of the eﬀect of the learning rate in SGD. Our analysis is based on the use of a learning-rate-dependent stochastic diﬀerential equation (LR-dependent SDE) as a tool that allows us to set SGD distinctively apart from both gradient descent and stochastic gradient Langevin dynamics (SGLD). In contrast to prior research, our analysis builds on the analysis of a partial diﬀerential equation that models the evolution of probability densities, drawing insights from Wainwright and Jordan (2006); Jordan (2018). From this perspective, we derive the linear convergence rate of the probability densities, highlighting its dependence on the learning rate. Moreover, we obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Witten-Laplacian, a special case of the Schr odinger operator associated with the LR-dependent SDE. This expression clearly reveals the dependence of the linear convergence rate on the learning rate the linear rate decreases rapidly to zero as the learning rate tends to zero for a broad class of nonconvex functions, whereas it stays constant for strongly convex functions. Based on this sharp distinction between nonconvex and convex problems, we provide a mathematical interpretation of the beneﬁts of using learning rate decay for nonconvex optimization.

c 2023 Bin Shi and Weijie J. Su and Michael I. Jordan.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/20-364.html.

Shi, Su, and Jordan

Keywords: Deep Neural Networks, Nonconvex Optimization, Stochastic Gradient Descent, LR-dependent SDE, Fokker Planck Smoluchowski equation, Schr odinger Operator, Witten-Laplacian, Learning Rate.

1. Introduction

Gradient-based optimization has been the workhorse algorithm powering recent developments in statistical machine learning. Many of these developments involve solving nonconvex optimization problems, which raises new challenges for theoreticians, given that classical theory has often been restricted to the convex setting. A particular focus in machine learning is the class of gradient-based methods referred to as stochastic gradient descent (SGD), given its desirable runtime properties, and its desirable statistical performance in a wide range of nonconvex problems. Consider the minimization of a (nonconvex) function f deﬁned in terms of an expectation:

f(x) = Eζf(x; ζ),

where the expectation is over the randomness embodied in ζ. A simple example is empirical risk minimization, where the loss function,

is averaged over n data points, where the datapoint-speciﬁc losses, fi(x), are indexed by i and where x denotes a parameter. When n is large, it is computationally prohibitive to obtain the full gradient of the objective function, and SGD provides a compelling alternative. SGD is a gradient-based update based on a (noisy) gradient evaluated from a single data point or a mini-batch: e f(x) := 1

i B fi(x) = f(x) + ξ,

where the set B of size B is sampled uniformly from the n data points and therefore the noise term ξ has mean zero. Starting from an initial point x0, SGD updates the iterates according to xk+1 = xk se f(xk) = xk s f(xk) sξk, (1)

where ξk denotes the noise term at the kth iteration. Note that the step size s > 0, also known as the learning rate, can either be constant or vary with the iteration Bottou (2010). The learning rate plays an essential role in determining the performance of SGD and many of the practical variants of SGD Bengio (2012).1 The overall eﬀect of the learning rate can be complex. In convex optimization problems, theoretical analysis can explain many aspects of this complexity, but in the nonconvex setting the eﬀect of the learning rate is yet more complex and theory is lacking Zeiler (2012); Kingma and Ba (2014). As a numerical illustration of this complexity, Figure 1 plots the error of SGD with a piecewise constant learning rate in the training of a neural network on the CIFAR-10 dataset. With

1. Note that the mini-batch size as another parameter can be, to some extent, incorporated into the learning rate. See the discussion later in this section.

On Learning Rates and Schr odinger Operators

0 100 200 300 400 0.15

Figure 1: Training error using SGD with mini-batch size 32 to train an 8-layer convolutional neural network on CIFAR-10 Krizhevsky (2009). The ﬁrst 90 epochs use a learning rate of s = 0.006, the next 120 epochs use s = 0.003, and the ﬁnal 190 epochs use s = 0.0005. Note that the training error decreases as the learning rate s decreases and a smaller s leads to a larger number of epochs for SGD to reach a plateau. See He et al. (2016) for further investigation of this phenomenon.

a constant learning rate, SGD quickly reaches a plateau in terms of training error, and whenever the learning rate decreases, the plateau decreases as well, thereby yielding better optimization performance. This illustration exempliﬁes the idea of learning rate decay, a technique that is used in training deep neural networks (see, e.g., He et al., 2016; Bottou et al., 2018; Sordello and Su, 2019). Despite its popularity and the empirical evidence of its success, however, the literature stops short of providing a general and quantitative approach to understanding how the learning rate impacts the performance of SGD and its variants in the nonconvex setting You et al. (2019); Li et al. (2019b). Accordingly, strategies for setting learning rate decay schedules are generally ad hoc and empirical.

In the current paper, we provide theoretical insight into the dependence of SGD on the learning rate in nonconvex optimization. Our approach builds on a recent line of work in which optimization algorithms are studied via the analysis of their behavior in continuoustime limits Su et al. (2016); Jordan (2018); Shi et al. (2018). Speciﬁcally, in the case of SGD, we study stochastic diﬀerential equations (SDEs) as surrogates for discrete stochastic optimization methods (see, e.g., Kushner and Yin, 2003; Li et al., 2017; Krichene and Bartlett, 2017; Chaudhari et al., 2018; Diakonikolas and Jordan, 2019). The construction is roughly as follows. Taking a small but nonzero learning rate s, let tk = ks denote a time step and deﬁne xk = Xs(tk) for some suﬃciently smooth curve Xs(t). Applying a Taylor

Shi, Su, and Jordan

expansion in powers of s, we obtain:

xk+1 = Xs(tk+1) = Xs(tk) + Xs(tk)s + O(s2).

Let W be a standard Brownian motion, where we assume that the noise term ξk is approximately normally distributed with unit variance. Informally, this leads to2

sξk = W(tk+1) W(tk) = s d W(tk)

dt + O(s2).

Plugging the last two displays into (1), we get

Xs(tk) + O(s) = f(Xs(tk)) + sd W(tk)

dt + O s 3 2 .

Retaining both O(1) and O( s) terms but ignoring smaller terms, we obtain a learningrate-dependent stochastic diﬀerential equation (LR-dependent SDE) that approximates the discrete-time SGD algorithm:

d Xs = f(Xs)dt + sd W, (2)

where the initial condition is the same value x0 as its discrete counterpart. More generally, Li et al. (2019a); Chaudhari and Soatto (2018) consider SDEs with variable-dependent noise covariance as approximating surrogates for SGD. The LR-dependent SDE (2) is a convenient simple model that allows for a ﬁne-grained analysis, as we will show in this paper. As an indication of the generality of this formulation, we note that it can seamlessly take account of the mini-batch size B; in particular, the eﬀective learning rate scales as O(s/B) in the mini-batch setting (see more discussion in Smith et al. (2017)). Throughout this paper we focus on (2) and regard s alone as the eﬀective learning rate.3

Intuitively, a larger learning rate s gives rise to more stochasticity in the LR-dependent SDE (2), and vice versa. Accordingly, the learning rate must have a substantial impact on the dynamics of SGD in its continuous-time formulation. In stark contrast, this parameter plays a fundamentally diﬀerent role on gradient descent (GD) and stochastic gradient Langevin dynamics (SGLD) when one considers their approximating diﬀerential equations. In particular, consider GD:

xk+1 = xk s f(xk),

which can be modeled by the following ordinary diﬀerential equation (ODE):

and the SGLD algorithm, which adds Gaussian noise ξk to the GD iterates:

xk+1 = xk s f(xk) + sξk,

2. Although a Brownian motion is not diﬀerentiable, the formal notation d W(t)/dt can be given a rigorous interpretation Evans (2012); Villani (2006). 3. Recognizing that the variance of ξk is inversely proportional to the mini-batch size B, we assume that the noise term ξk has variance σ2/B. Under this assumption the resulting SDE reads d Xs = f(Xs)dt + σ p

s/Bd W. In light of this, the eﬀective learning rate through incorporating the mini-batch size is O(σ2s/B).

On Learning Rates and Schr odinger Operators

and its SDE model: d X = f(X)dt + d W.

These diﬀerential equations are derived in the same way as (2), namely by the Taylor expansion and retaining O(1) and O( s) terms.4 While the SDE for modeling SGD sets the square root of the learning rate to be its diﬀusion coeﬃcient, both the GD and SGLD counterparts are completely free of this parameter. This distinction between SGD and the other two methods is reﬂected in their diﬀerent numerical performance as revealed in Figure 2. The right plot of this ﬁgure shows that the behaviors of both GD and SGLD in the time t = ks scale are almost invariant in terms of optimization error with respect to the learning rate. In striking contrast, the stationary optimization error of SGD decreases signiﬁcantly as the learning rate decays. As a consequence of this distinction, GD and SGLD do not exhibit the phenomenon that is shown in Figure 1.

0 250 500 750 1000 10-3

0 25 50 75 100 10-3

Figure 2: Illustrative examples showing distinct behaviors of GD, SGD, and SGLD. The y-axis displays the optimization error f(xk) f(x ), where f(x ) denotes the minimum value of the objective and in the case of SGD and SGLD f(xk) denotes an average over 1000 replications. The objective function is f(x1, x2) = 5 10 2x2 1 + 2.5 10 2x2 2, with an initial point (8, 8), and the noise ξk in the gradient follows a standard normal distribution. Note that SGD with s = 1 is identical to SGLD with s = 1. As shown in the right panel, taking time t = ks as the x-axis, the learning rate has little to no impact on GD and SGLD in terms of optimization error.

1.1 Overview of contributions

The discussion thus far suggests that one may examine the eﬀect of the learning rate in SGD using the LR-dependent SDE (2). In particular, this SDE distinguishes SGD from GD and SGLD. Accordingly, in the current paper, we study the LR-dependent SDE and make the following contributions.

1. LR-dependent Fokker Planck Smoluchowski equation. The perspective of considering the evolution behavior of probability distributions over points instead of

4. The coeﬃcients of the O( s) terms turn out to be zero in both diﬀerential equations. See more discussion in Appendix A.1 and particularly Figure 12 therein.

Shi, Su, and Jordan

a single point is proposed in Wainwright and Jordan (2006); Jordan (2018). In this paper, we instantiate this perspective for SGD via the LR-dependent SDE and use existing techniques to derive the governing LR-dependent Fokker Planck Smoluchowski equation for the evolution of the probability densities. By utilizing the error decomposition in Raginsky et al. (2017), we show that, for a large class of (nonconvex) objectives, the continuous-time formulation of SGD converges to its stationary distribution at a linear rate.5 In particular, the solution Xs(t) to the LR-dependent SDE obeys Ef(Xs(t)) f ϵ(s) + C(s)e λst, (3)

where f denotes the global minimum of the objective function f, ϵ(s) denotes the risk at stationarity, and C(s) depends on both the learning rate and the distribution of the initial x0. Notably, we show that ϵ(s) decreases monotonically to zero as s 0, which is conducted from the temperature parameter in Raginsky et al. (2017). For any ﬁxed time T > 0, this bound can be carried over to the discrete case by a uniform approximation between SGD and the LR-dependent SDE (2). Speciﬁcally, the term C(s)e λst becomes C(s)e λsks, showing that the convergence is linear as well in the discrete regime. This is consistent with the numerical evidence from Figure 1 and Figure 2.

This convergence result sheds light on why SGD performs so well in many practical nonconvex problems. In particular, while GD can be trapped in a local minimum, SGD can eﬃciently escape it provided that the linear rate λs is not too small (this is the case if s is suﬃciently large; see the second contribution). This superiority of SGD in the nonconvex setting must be attributed to the noise in the gradient and this implication is consistent with earlier work showing that stochasticity in gradients signiﬁcantly accelerates the escape of saddle points for gradient-based methods Jin et al. (2017); Lee et al. (2016).

2. Eﬀect of learning rate on the nonconvex functions. The ﬁrst contribution stops short of saying anything about how λs depends on the learning rate s and the geometry of the objective f. Such an analysis is fundamental to an explanation of the diﬀering eﬀects of the learning rate in deep learning (nonconvex optimization) and convex optimization. In the current paper we show that if the objective f is a nonconvex function and satisﬁes certain regularity conditions, we have:6

for a certain value Hf > 0 that only depends on f. This expression for λs enables a concrete interpretation of the eﬀect of learning rate in Figure 1. In brief, in the nonconvex setting, λs decreases to zero quickly as the learning rate s tends to zero. As a consequence, with a large learning rate s at the beginning, SGD converges rapidly to stationarity and the rate becomes smaller as the learning rate decreases.

5. Roughly speaking, stationarity refers to the distribution of Xs(t) in the limit t . See a more precise deﬁnition in Figure 3. 6. We write am bm if there exist positive constants c and c such that cbm am c bm for all m.

On Learning Rates and Schr odinger Operators

For comparison, λs is equal to µ if f is µ-strongly convex for µ > 0, regardless of the learning rate s. (In this case, the solution to the SDE converges to the global minimum with a learning rate of 1/t Hazan et al. (2008).) As such, the convergence behaviors of SGD are necessarily diﬀerent between convex and nonconvex objectives. To appreciate this implication, we refer to Figure 3. Note that all four plots show

0 500 1000 1500 2000 10-2

0 25 50 75 100 10-2

0 10000 20000 30000 40000 50000 10-2

0 500 1000 1500 2000 2500 10-2

Figure 3: The dependence of the optimization dynamics of SGD on the learning rate diﬀers between convex objectives and nonconvex objectives. The learning rate is set to either s = 0.1 or s = 0.05. The two top plots consider minimizing a convex function f(x1, x2) = 5 10 2x2 1 + 2.5 10 2x2 2, with an initial point (8, 8), and the bottom plots consider minimizing a nonconvex function f(x1, x2) = [(x1 + 0.7)2 + 0.1](x1 0.7)2 + (x2 +0.7)2[(x2 0.7)2 +0.1], with an initial point ( 0.9, 0.9). The gradient noise is drawn from the standard normal distribution. All results are averaged over 10000 independent replications.

that a larger learning rate gives rise to a larger stationary risk, as predicted by the monotonically increasing nature of ϵ with respect to s in (3). The most salient part of this ﬁgure is, however, shown in the right panel. Speciﬁcally, the right panel, which uses time t as the x-axis, shows that in the (strongly) convex setting the linear rate of the convergence is roughly the same between the two choices of learning rate, which is consistent with the result that λs is constant in the case of a strongly convex objective. In the nonconvex case (bottom right), however, the rate of convergence is more rapid

Shi, Su, and Jordan

with the larger learning rate s = 0.1, which is implied by the fact that λ0.1 > λ0.05. In stark contrast, the two plots in the left panel, which use the number k of iterations for the x-axis, are observed to have a larger rate of linear convergence with a larger learning rate. This is because in the k scale the rate λss of linear convergence always increases as s increases no matter if the objective is convex or nonconvex.

The mathematical tools that we bring to bear in analyzing the LR-dependent SDE (2) are as follows. We establish the linear convergence via a Poincar e-type inequality that is due to Villani Villani (2009). The asymptotic expression for the rate λs is proved by making use of the spectral theory of the Schr odinger operator or, more concretely, the Witten-Laplacian associated with the Fokker Planck Smoluchowski equation that governs the LR-dependent SDE. Diﬀerent from the traditional probabilistic analysis, functional approaches are based on couplings, and the analysis based on the Schr odinger operator is based on the spectral theory of the operator, which is essentially an inﬁnite-dimensional generalization of the ﬁnite-dimensional matrix. In other words, the analysis from operators reposes on an inﬁnite-dimensional system, and generalizes the classical convergence analysis for a ﬁnite-dimensional dynamical system via the eigenvalues of the matrix Hirsch et al. (2012). Additionally, the spectral theory can be easily generalized to the momentum case. We believe that these tools will prove to be useful in theoretical analyses of other stochastic approximation methods.

1.2 Related work

Recent years have witnessed a surge of research devoted to explanations of the eﬀectiveness of deep neural networks, with a particular focus on understanding how the learning rate aﬀects the behavior of stochastic optimization. In Smith et al. (2017); Keskar et al. (2016), the authors uncovered various tradeoﬀs linking the learning rate and the mini-batch size. Moreover, Jastrzebski et al. (2017, 2018) related the learning rate to the generalization performance of neural networks in the early phase of training. This connection has been further strengthened by the demonstration that learning rate decay encourages SGD to learn features of increasing complexity Li et al. (2019b); You et al. (2019). From a topological perspective, Davis et al. (2019) establish connections between the learning rate and the sharpness of local minima. Empirically, deep learning models work well with non-decaying schedules such as cyclical learning rates Loshchilov and Hutter (2016); Smith (2017) (see also the review Sun (2019)), with recent theoretical justiﬁcation Li and Arora (2019). In a diﬀerent direction, there has been a ﬂurry of activity in using dynamical systems to analyze discrete optimization methods. For example, Su et al. (2016); Wibisono et al. (2016); Shi et al. (2018) derived ODEs for modeling Nesterov s accelerated gradient methods and used the ODEs to understand the acceleration phenomenon (see the review Jordan (2018)). In the stochastic setting, this approach has been recently pursued by various authors Chaudhari et al. (2018); Chaudhari and Soatto (2018); Mandt et al. (2016); Lee et al. (2016); Caluya and Halder (2019); Li et al. (2017) to establish various properties of stochastic optimization. As a notable advantage, the continuous-time perspective allows us to work without assumptions on the boundedness of the domain and gradients, as opposed to older analyses of SGD (see, for example, Hazan et al. (2008)).

On Learning Rates and Schr odinger Operators

Our work is motivated in part by the recent progress on Langevin dynamics, in particular in nonconvex settings Villani (2009); Pavliotis (2014); Helﬀer et al. (2004); Bovier et al. (2005). In relating to Langevin dynamics, s in the LR-dependent SDE can be thought of as the temperature parameter and, under certain conditions, this SDE has a stationary distribution given by the Gibbs measure, which is proportional to exp( 2f/s). Of particular relevance to the present paper from this perspective is a line of work that has considered the optimization properties of SGLD and analyzed its convergence rates Hwang (1980); Raginsky et al. (2017); Zhang et al. (2017). The LR-dependent SDE is formally similar to SGLD, in particular they both share the same Gibbs invariant distribution Raginsky et al. (2017). Linear convergence can be established for SGLD via the technique of the synchronous coupling Eberle (2016). Our approach provides an alternative to this line of work. The LR-dependent SDE is derived in our work as a surrogate for approximating SGD, and our analysis makes use of the Poincar e inequality under the Villani condition to obtain the L2-distance of the probability densities instead of the 2-Wasserstein distance Raginsky et al. (2017). The advantages of our analysis hinge on the fact that it provides a concise and sharp delineation of the convergence rate based on the geometric properties of the objective function.

1.3 Organization

The remainder of the paper is structured as follows. In Section 2 we introduce basic assumptions and techniques employed throughout the paper. Section 3 develops our main theorems. In Section 4, we use the results of Section 3 to oﬀer insights into the beneﬁt of taking a larger initial learning rate followed by a sequence of decreasing learning rates in training neural networks. Section 5 formally proves the linear convergence (3) and Section 6 further speciﬁes the rate of convergence (4). Technical details of the proofs are deferred to the appendices. We conclude the paper in Section 7 with a few directions for future research.

2. Preliminaries

Throughout this paper, we assume that the objective function f is inﬁnitely diﬀerentiable in Rd; that is, f C (Rd). We use to denote the standard Euclidean norm.

Deﬁnition 1 (Conﬁning condition Pavliotis (2014); Markowich and Villani (1999)) A function f is said to be conﬁning if it is inﬁnitely diﬀerentiable and satisﬁes lim x + f(x) = + and exp( 2f/s) is integrable for all s > 0: Z

This condition is quite mild; it essentially requires that the function grows suﬃciently rapidly when x is far from the origin. This condition is met, for example, when an ℓ2 regularization term is added to the objective function f or, equivalently, weight decay is employed in the SGD update. Next, we need to show that the LR-dependent SDE (2) with an arbitrary learning rate s > 0 admits a unique global solution under mild conditions on the objective f. We will

Shi, Su, and Jordan

show in Section 3.3 that the solution to this SDE approximates the SGD iterates well. The formal description is shown rigorously in Proposition 8. Recall that the LR-dependent SDE (2) is d Xs = f(Xs)dt + sd W,

where the initial point Xs(0) is distributed according to a probability density function ρ in Rd, independent of the standard Brownian motion W. It is well known that the probability density ρs(t, ) of Xs(t) evolves according to the LR-dependent Fokker Planck Smoluchowski equation ρs

t = (ρs f) + s

with the boundary condition ρs(0, ) = ρ. Here, is the Laplacian. For completeness, in Appendix A.2 we derive this LR-dependent Fokker Planck Smoluchowski equation from the LR-dependent SDE (2) by Itˆo s formula. If the objective f satisﬁes the conﬁning condition, then this equation admits a unique invariant Gibbs distribution that takes the form

The proof of uniqueness is shown in Appendix A.3. The normalization factor is Zs = R

s dx. Taking any initial probability density ρs(0, ) ρ in L2(µ 1 s ) (a measurable

function g is said to belong to L2(µ 1 s ) if g µ 1 s := R

Rd g2µ 1 s dx 1

2 < + ), we have the following guarantee:

Lemma 2 (Existence and uniqueness of the weak solution) For any conﬁning function f and any initial probability density ρ L2(µ 1 s ), the LR-dependent SDE (2) admits a weak solution whose probability density in C1 [0, + ), L2(µ 1 s ) is the unique solution to the LR-dependent Fokker Planck Smoluchowski equation (5).

The proof of Lemma 2 can be obtained by Harnack s inequality, a classical approach using a second-order elliptic operator, as described in Bogachev et al. (2009). We present an alternative proof of Lemma 2 based on the spectral theory of the Schr odinger operator in Appendix A.4. We also present a companion result in Lemma 10 in Section 5, which shows that the probability density ρs(t, ) converges to the Gibbs distribution as t . Finally, we need a condition that is due to Villani for the development of our main results in the next section.

Deﬁnition 3 (Villani condition Villani (2009)) A conﬁning function f is said to satisfy the Villani condition if f(x) 2/s f(x) + as x + for all s > 0.

This condition amounts to saying that the gradient has a suﬃciently large squared norm compared with the Laplacian of the function. Strictly speaking, some loss functions used for training neural networks might not satisfy this condition. However, the Villani condition does not look as stringent as it appears since this condition is essentially concerned with the function at inﬁnity. In this paper, we use the Villani condition to derive the Poincar e inequality and the discrete spectrum of the Witten-Laplacian. There are alternatives to the

On Learning Rates and Schr odinger Operators

Villani condition; see (Bakry et al., 2008, Corollary 1.6). For example, we can replace the Villani condition with the following condition

as x + . However, it is unknown whether these conditions lead to the result that the spectrum of Witten-Laplacian is discrete.

3. Main Results

In this section, we state our main results. In brief, in Section 3.1 we show linear convergence to stationarity for SGD in its continuous formulation, the LR-dependent SDE. In Section 3.2, we derive a quantitative expression of the rate of linear convergence and study the diﬀerence in the behavior of SGD in the convex and nonconvex settings. This distinction is further elaborated in Section 3.3 by carrying over the continuous-time convergence guarantees to the discrete case. Finally, Section 3.4 oﬀers an exposition of the theoretical results in the univariate case. Proofs of the results presented in this section are deferred to Section 5 and Section 6.

3.1 Linear convergence

In this subsection we are concerned with the expected excess risk, Ef(Xs(t)) f . Recall that f = infx f(x).

Theorem 1 Let f satisfy both the conﬁning condition and the Villani condition. Then there exists λs > 0 for any learning rate s > 0 such that the expected excess risk satisﬁes

Ef(Xs(t)) f ϵ(s) + D(s)e λst, (7)

for all t 0. Here ϵ(s) = ϵ(s; f) 0 is a strictly increasing function of s depending only on the objective function f, and D(s) = D(s; f, ρ) 0 depends only on s, f, and the initial distribution ρ.

Brieﬂy, the proof of this theorem is based on the following decomposition of the excess risk: Ef(Xs(t)) f = Ef(Xs(t)) Ef(Xs( )) + Ef(Xs( )) f ,

where we informally use Ef(Xs( )) to denote EX µsf(X) in light of the fact that Xs(t) converges weakly to µs as t + (see Lemma 10). The question is thus to quantify how fast Ef(Xs(t)) Ef(Xs( )) vanishes to zero as t and how the excess risk at stationarity Ef(Xs( )) f depends on the learning rate. The following two propositions address these two questions. Recall that ρ L2(µ 1 s ) is the probability density of the initial iterate in SGD.

Proposition 4 Under the assumptions of Theorem 1, there exists λs > 0 for any learning rate s such that

|Ef(Xs(t)) Ef(Xs( ))| C(s) ρ µs µ 1 s e λst,

Shi, Su, and Jordan

for all t 0, where the constant C(s) > 0 depends only on s and f, and where

ρ µs µ 1 s = Z

Rd (ρ µs)2 µ 1 s dx 1

measures the gap between the initialization and the stationary distribution.

Loosely speaking, it takes O(1/λs) time to converge to stationarity. In relating to Theorem 1, D(s) can be set to C(s) ρ µs µ 1 s . Notably, the proof of Proposition 4 shall reveal that C(s) increases as s increases. Turning to the analysis of the second term, Ef(Xs( )) f , we henceforth write ϵ(s) := Ef(Xs( )) f .

Proposition 5 Under the assumptions of Theorem 1, the excess risk at stationarity, ϵ(s), is a strictly increasing function of s. Moreover, for any S > 0, there exists a constant A that depends only on S and f and satisﬁes

ϵ(s) Ef(Xs( )) f As,

for any learning rate 0 < s S.

The two propositions are proved in Section 5. The proof of Theorem 1 is a direct consequence of Proposition 4 and Proposition 5. More precisely, the two propositions taken together give Ef(Xs(t)) f O(s) + C(s)e λst, (8)

for a bounded learning rate s. Note that a dimension-dependent upper bound of O(ds) is provided for ϵ(s) in (Raginsky et al., 2017, Section 3.5). This estimate is obtained by evaluating both the second moment of the invariant distribution via the Euclidean 2-Wasserstein distance and the integral constant based on the global gradient Lipschitz condition. However, it is worth noting that the global gradient Lipschitz condition may not be practical in real-world scenarios. In line with Theorem 2, the constant A in this case is also dependent on the geometry of f; for more details see Section 5.2. Taken together, these results oﬀer insights into the phenomena observed in Figure 1. In particular, Proposition 4 states that, from the continuous-time perspective, the risk of SGD with a constant learning rate applied to a (nonconvex) objective function converges to stationarity at a linear rate. Moreover, Proposition 5 demonstrates that the excess risk at stationarity decreases as the learning rate s tends to zero. This is in agreement with the numerical experiments illustrated in Figure 1, Figure 2, and Figure 3. For comparison, this property is not observed in GD and SGLD. The following result gives the time complexity of SGD in its continuous-time formulation.

Corollary 6 Under the assumptions of Proposition 5, for any ϵ > 0, if the learning rate

s min{ϵ/(2A), S} and t 1 λs log 2C(s) ρ µs µ 1 s ϵ , then

Ef(Xs(t)) f ϵ.

On Learning Rates and Schr odinger Operators

3.2 The rate of linear convergence

We now turn to the key issue of understanding how the linear rate λs depends on the learning rate. In this subsection, we show that for certain objective functions, λs admits a simple expression that allows us to interpret how the convergence rate depends on the learning rate. We begin by considering a strongly convex function. Recall the deﬁnition of strong convexity: for µ > 0, a function f is µ-strongly convex if

f(y) f(x) + f(x), y x + µ

for all x, y. Equivalently, f is µ-strong convex if all eigenvalues of its Hessian 2f(x) are greater than or equal to µ for all x (note that here f is assumed to be inﬁnitely diﬀerentiable). As is clear, a strongly convex function satisﬁes the conﬁning condition. In Appendix B.1, we prove the following proposition by making use of a Poincar e-type inequality, the Bakry Emery theorem Bakry et al. (2013).

Proposition 7 In addition to the assumptions of Theorem 1, assume that the objective f is a µ-strongly convex function. Then, λs in (7) satisﬁes λs = µ.

We turn to the more challenging setting where f is nonconvex. Let us refer to the objective f as a Morse function if its Hessian has full rank at any critical point x (that is, f(x) = 0).7

Theorem 2 In addition to the assumptions of Theorem 1, assume that the objective f is a Morse function and has at least two local minima.8 Then the constant λs in (7) satisﬁes

λs = (α + o(s))e 2Hf

for 0 < s s0, where s0 > 0, α > 0, and Hf > 0 are constants that all depend only on f.

The proof of this result relies on tools in the spectral theory of Schr odinger operators and is deferred to Section 6. From now on, we call λs in (7) the exponential decay constant. To obviate any confusion, o(s) in Theorem 2 stands for a quantity that tends to zero as s 0, and the precise expression for Hf shall be given in Section 6, with a simple example provided in Section 3.4. To leverage Theorem 2 for understanding the phenomena discussed in Section 1, however, it suﬃces to recognize the fact that Hf is completely determined by f. Moreover, we remark that while Theorem 1 shows that λs exists for any learning rate, the present theorem assumes a bounded learning rate. The key implication of this result is that the rate of convergence is highly contingent upon the learning rate s: the exponential decay constant increases as the learning rate s increases. Accordingly, the linear convergence to stationarity established in Section 3.1 is faster if s is larger, and, by recognizing the exponential dependence of λs on s, the convergence would

7. See Section 6.2 for a discussion of Morse functions. Note that (inﬁnitely diﬀerentiable) strongly convex functions are Morse functions. 8. We call x a local minimum of f if f(x) = 0 and the Hessian 2f(x) is positive deﬁnite. By convention, in this paper a global minimum is also considered a local minimum.

Shi, Su, and Jordan

be very slow if the learning rate s is very small. For example, if Hf = 0.05, setting s = 0.1 and s = 0.001 gives

λ0.1 λ0.001 e 1

e 100 = 9.889 1042.

Moreover, as we will see clearly in Section 6, λs is completely determined by the geometry of f. In particular, it does not depend on the probability distribution of the initial point or the dimension d given that the constant Hf has no direct dependence on the dimension d. For comparison, the linear rate in the nonconvex case is shown by Theorem 2 to depend on the learning rate s, while the linear rate of convergence stays constant regardless of s if the objective is strongly convex. This fundamental distinction between the convex and nonconvex settings enables an interpretation of the observation brought up in Figure 1, in particular the right panel of Figure 3. More precisely, with time t being the x-axis, SGD with a larger learning rate leads to a faster convergence rate in the nonconvex setting, while for the (strongly) convex setting the convergence rate is independent of the learning rate. For further in-depth discussion of the implications of Theorem 2, see Section 4.

3.3 Discretization

In this subsection, we carry over the results developed from the continuous perspective to the discrete regime. In addition to assuming that the objective function f satisﬁes the Villani condition, satisﬁes the conﬁning condition, and is a Morse function, we also now assume f to be L-smooth; that is, f has L-Lipschitz continuous gradients in the sense that f(x) f(y) L x y for all x, y. Moreover, we restrict the learning rate s to be no larger than 1/L. The following proposition is the key theoretical tool that allows translation to the discrete regime.

Proposition 8 For any L-smooth objective f and any initialization Xs(0) drawn from a probability density ρ L2(µ 1 s ), the LR-dependent SDE (2) has a unique global solution Xs in expectation; that is, EXs(t) as a function of t in C1([0, + ); Rd) is unique. Moreover, there exists B(T) > 0 such that the SGD iterates xk satisfy

max 0 k T/s |Ef(xk) Ef(Xs(ks))| B(T)s,

for any ﬁxed T > 0.

We note that there exists a sharp bound on B(T) in Bally and Talay (1996). For completeness, we also remark that the convergence can be strengthened to the strong sense:

max 0 k T/s E xk Xs(ks) B (T)s.

This result has appeared in Mil shtein (1975); Talay (1982); Pardoux and Talay (1985); Talay (1984); Kloeden and Platen (1992) and we provide a self-contained proof in Appendix B.2. We now state the main result of this subsection.

On Learning Rates and Schr odinger Operators

Theorem 3 In addition to the assumptions of Theorem 1, assume that f is L-smooth. Then, for any T > 0, the iterates of SGD with learning rate 0 < s 1/L satisfy

Ef(xk) f (A + B(T))s + C ρ µs µ 1 s e sλsk, (10)

for all k T/s, where λs is the exponential decay constant in (7), A as in Proposition 5 depends only on 1/L and f, C = C1/L is as in Proposition 4, and B(T) depends only on the time horizon T and the Lipschitz constant L.

Theorem 3 follows as a direct consequence of Theorem 1 and Proposition 8. Note that if f is a Morse function with at least two local minima, then λs appearing in (10) is given by (9), and if f is µ-strongly convex then λs = µ. As earlier in the continuous-time formulation, we also mention that the dimension parameter d is not an essential parameter for characterizing the rate of linear convergence. In relating to Figure 3, note that its left panel with k being the x-axis shows a faster linear convergence of SGD when using a larger learning rate, regardless of convexity or nonconvexity of the objective. This is because the linear rate sλs in (10) is always an increasing function of s even for the strongly convex case, where λs itself is constant.

3.4 A one-dimensional example

In this section we provide some intuition for the theoretical results presented in the preceding subsections. Our priority is to provide intuition rather than rigor. Consider the simple example of f presented in Figure 4, which has a global minimum x , a local minimum x , and a local maximum x .9 We use this toy example to gain insight into the expression (9) for the exponential decay constant λs; deferring the rigorous derivation of this number in the general case to Section 6. From (7) it appears that the LR-dependent SDE (2) takes about O(1/λs) time to achieve approximate stationarity. Intuitively, for the speciﬁc function in Figure 4, the bottleneck in achieving stationarity is to pass through the local maximum x . Now, we show that it takes about O(1/λs) time to pass x from the local minimum x . For simplicity, write

2(x x )2 + g(x),

where g(x) = f(x ) stays constant if x x ν for a very small positive ν and θ > 0. Accordingly, the LR-dependent SDE (2) is reduced to the Ornstein Uhlenbeck process,

d Xs = θ(Xs x )dt + sd W,

before hitting x . Denote by τx the ﬁrst time the Ornstein Uhlenbeck process hits x . It is well known that the hitting time obeys

Eτx πs (x x )θ

2 θ(x x )2 πs (x x )θ

9. We can also regard x as a saddle point in the sense that the Hessian at this point has one negative eigenvalue. See Section 6.2 for more discussion.

Shi, Su, and Jordan

Figure 4: A one-dimensional nonconvex function f. The height diﬀerence between x and x in this special case is the Morse saddle barrier Hf. See the formal deﬁnition in Deﬁnition 20.

where Hf := f(x ) f(x ) f(x ) g(x ) = 1 2θ(x x )2. This number, which we refer to as the Morse saddle barrier, is the diﬀerence between the function values at the local maximum x and the local minimum x in our case. As an implication of (11), the

continuous-time formulation of SGD takes time (at least) of the order e(1+o(1)) 2Hf

s to achieve approximate stationarity. This is consistent with the exponential decay constant λs given in (9). In passing, we remark that the discussion above can be made rigorous by invoking the theory of the Kramers escape rate, which shows that for this univariate case the hitting time satisﬁes Eτx = (1 + o(1)) π p

f (x )f (x ) e

See, for example, Freidlin and Wentzell (2012); Pavliotis (2014). Furthermore, we demonstrate the view from the theory of viscosity solutions and singular perturbations in Appendix B.3.

4. Why Learning Rate Decay?

As a widely used technique for training neural networks, learning rate decay refers to taking a large learning rate initially and then progressively reducing it during the training process. This technique has been observed to be highly eﬀective especially in the minimization of nonconvex objective functions using stochastic optimization methods, with a very recent strand of theoretical eﬀort aiming at understanding its beneﬁts You et al. (2019); Li et al. (2019b). In this section, we oﬀer a new and crisp explanation by leveraging the results in Section 3. To highlight the intuition, we primarily work with the continuous-time formulation of SGD. For purposes of illustration, Figure 5 presents numerical examples for this technique where the learning rate is set to 0.1 or 0.05. This ﬁgure clearly demonstrates that SGD

On Learning Rates and Schr odinger Operators

-1 -0.5 0 0.5 1 -1

-1 -0.5 0 0.5 1 -1

-1 -0.5 0 0.5 1 -1

-1 -0.5 0 0.5 1 -1

-1 -0.5 0 0.5 1 -1

-1 -0.5 0 0.5 1 -1

Figure 5: Scatter plots of the iterates xk R2 of SGD for minimizing the nonconvex function in Figure 3. This function has four local minima, of which the bottom right one is the global minimum. Each column corresponds to the same value of t = ks, and the ﬁrst row and second row correspond to learning rates 0.1 and 0.05, respectively. The gradient noise is drawn from the standard normal distribution. Each plot is based on 10000 independent SGD runs using the noise generator state 1-10000 in Matlab2019b, starting from an initial point ( 0.9, 0.9).

with a larger learning rate converges much faster to the global minimum than SGD with a smaller learning rate. This comparison reveals that a large learning rate would render SGD able to quickly explore the landscape of the objective function and eﬃciently escape bad local minima. On the other hand, a larger learning rate would prevent SGD iterates from concentrating around a global minimum, leading to substantial suboptimality. This is clearly illustrated in Figure 6. As suggested by the heuristic work on learning rate decay, we see that it is important to decrease the learning rate to achieve better optimization performance whenever the iterates arrive near a local minimum of the objective function.

Despite its intuitive plausibility, the exposition above stops short of explaining why nonconvexity of the objective is crucial to the eﬀectiveness of learning rate decay. Our results in Section 3, however, enable a concrete and crisp understanding of the vital importance of nonconvexity in this setting. Motivated by (8), we consider an idealized risk function of the form R(t) = as + (b as)e λst, with λs set to e c/s, where a, b, and c are positive constants for simplicity as opposed to the non-constants in the upper bound in (7). This function is plotted in Figure 7, with two quite diﬀerent learning rates, s1 = 0.1 and s2 = 0.001, as an implementation of learning rate decay. When the learning rate is s1 = 0.1, from the right panel of Figure 7, we see that rough stationarity is achieved at time t = ks 25; thus, the number of iterations k0.1 25/s = 250. In the case of s = 0.001, from the left panel of Figure 7, we see now it requires ks 2.5 1044 to reach rough stationarity, leading to

Shi, Su, and Jordan

-1 -0.5 0 0.5 1 -1

-1 -0.5 0 0.5 1 -1

Figure 6: The same setting as in Figure 5. Both plots correspond to the same value of t = ks = 1000.

k0.001 2.5 1047. This gives k0.001

In contrast, the sharp dependence of ks on the learning rate s is not seen for strongly convex functions, because λs = µ stays constant as the learning rate s varies. Following the preceding example, we have k0.001

0 1 2 3 4 5 6 7 8 9 10

0 10 20 30 40 50 60 70 80 90 100 10-2

Figure 7: Idealized risk function of the form R(t) = as+(b as)e e c s t with the identiﬁcation t = ks, which is adapted from (8). The parameters are set as follows: a = 1, b = 100, c = 0.1, and the learning rate is s = 0.1 or 0.001. The right plot is a locally enlarged image of the left.

While a large initial learning rate helps speed up the convergence, Figure 7 also demonstrates that a larger learning rate leads to a larger value of the excess risk at stationarity,

On Learning Rates and Schr odinger Operators

ϵ(s) Ef(Xs( )) f , which is indeed the claim of Proposition 5. Leveraging Proposition 4, we show below why annealing the learning rate at some point would improve the optimization performance. To this end, for any ﬁxed learning rate s, consider a stopping time T δ s that is deﬁned as

T δ s := inf t {|Ef(Xs(t)) Ef(Xs( ))| δϵ(s)} ,

for a small δ > 0. In words, the LR-dependent SDE (2) at time T δ s is approximately stationary since its risk Ef(Xs(t)) f is mainly comprised of the excess risk at stationarity ϵ(s), with a total risk of no more than (1+δ)ϵ(s). From Proposition 4 it follows that (recall that ρ is the initial distribution):

λs log C(s) ρ µs µ 1 s δϵ(s) = e

s γ + o(s) log C(s) ρ µs µ 1 s δϵ(s) . (12)

In addition to taking a large s, an alternative way to make T δ s small is to have an initial distribution ρ that is close to the stationary distribution µs. This can be achieved by using the technique of learning rate decay. More precisely, taking a larger learning rate s1 for a while, at the end the distribution of the iterates is approximately the stationary distribution µs1, which serves as the initial distribution for SGD with a smaller learning rate s2 in the second phase. Taking ρ µs1, the factor ρ µs µ 1 s in (12) for the second phase of learning rate decay is approximately

µs1 µs2 µ 1 s2 = Z (µs1 µs2)2µ 1 s2 dx 1

2 = Z µ2 s1 µs2 dx 1 1

Both µs1 and µs2 are decreasing functions of f and, therefore, have the same modes. As s1 0, both µs1 and µs2 tend to δ(x x ), thereby implying µs1/µs2 1. As a consequence, the integral of µ2 s1/µs2 minus one is small by appealing to the rearrangement inequality, thereby leading to fast convergence of SGD with learning rate s2 to the stationary risk ϵ(s2). In contrast, ρ µs2 µ 1 s2 would be much larger for a general random initialization ρ. Put simply, SGD with learning rate s2 cannot achieve a risk of approximately ϵ(s2) given the same number of iterations without the warm-up stage using learning rate s1. See Figure 8 for an illustration.

ρ µs1 µs2 large learning rate s1 small learning rate s2

Figure 8: Learning rate decay. The ﬁrst phase uses a larger learning rate s1, at the end of which the SGD iterates are approximately distributed as µs1. The second phase uses a smaller learning rate s2 and at the end the distribution of the SGD iterates roughly follows µs2.

In Chiang et al. (1987), the concept of simulated annealing is introduced to the diﬀusion process. It is equivalent to the time-decay learning rate as s = c/ log t in the LR-dependent SDE. Through probabilistic analysis, Chiang et al. (1987) derives that the linear rate is t c

Shi, Su, and Jordan

and the invariant distribution is exp( 2f(x) log t/c).10 Although the invariant distribution concentrates on the global minima as t approaches inﬁnity, the linear rate decays rapidly, resulting in extremely slow convergence of the distribution. However, it is worth noting that the process described in Chiang et al. (1987) is a single-phase process. In practice, the phenomenon generated by the learning rate decaying piecewise is likely a two-phase process. The ﬁrst phase involves global convergence with the learning rate exp( 2Hf/s1), while the second phase is likely the local convergence with the learning rate µ. This is because most of the invariant distribution exp( 2f(x)/s1) concentrates on the neighborhood of the global minima. We note also that Kushner (1987) provides a derivation of both the mean escape time and the mean transition time using the theory of large deviations for the discrete case.

5. Proof of the Linear Convergence

In this section, we prove Proposition 4 and Proposition 5, leading to a complete proof of Theorem 1.

5.1 Proof of Proposition 4

To better appreciate the linear convergence of the LR-dependent SDE (2), as established in Proposition 4, we start by showing the convergence to stationarity without a rate. In fact, this intermediate result constitutes a necessary step in the proof of Proposition 4. The techniques presented in this section are standard in the literature (see, for example, Villani (2009); Pavliotis (2014)).

Convergence without a rate. Recall that we use ρ to denote the initial probability density in L2(µ 1 s ). Superﬁcially, it seems that the most natural space for probability densities is L1(Rd). However, it is mathematically convenient to work on an inner product space as opposed to a general Banach space to prove convergence results for the LR-dependent SDE. Indeed, studying densities in L2(µ 1 s ) is a common strategy. Formally, the following result says that any (nonnegative) function in L2(µ 1 s ) can be normalized to be a density function. The proof of this simple lemma is shown in Appendix C.1.

Lemma 9 Let f satisfy the conﬁning condition. Then, L2(µ 1 s ) is a subset of L1(Rd).

The following result shows that the solution to the LR-dependent SDE converges to stationarity in terms of the dynamics of its probability densities over time.

Lemma 10 Let f satisfy the conﬁning condition and denote the initial distribution as ρ L2(µ 1 s ). Then, the unique solution ρs(t, ) C1 [0, + ), L2(µ 1 s ) to the Fokker Planck Smoluchowski equation (5) converges in L2(µ 1 s ) to the Gibbs invariant distribution µs, which is speciﬁed by (6).

10. The constant c mentioned here is used to distinguish it from the previous constant c. It serves as a separate identiﬁer to avoid confusion or ambiguity.

On Learning Rates and Schr odinger Operators

Proof [Proof of Lemma 10] We have

d dt ρs(t, ) µs 2 µ 1 s = d

Rd (hs(t, x) 1)2 dµs

Rd(hs 1)Ls(hs 1)dµs,

where the last equality is due to (15). Next, we proceed by making use of Lemma 11:

Rd(hs 1)Ls(hs 1)dµs = s Z

Rd (hs 1) (hs 1)dµs

Rd hs 2dµs 0. (14)

Thus, ρs(t, ) µs 2 µ 1 s is a strictly decreasing function, decreasing asymptotically towards the equilibrium state Z

Rd hs 2dµs = 0.

This equality holds, however, only if hs(t, ) is constant. Because both ρs(t, ) and µs are probability densities, this case must imply that hs(t, ) 1; that is, ρs(t, ) µs. Therefore, ρs(t, ) C1 [0, + ), L2(µ 1 s ) converges to the Gibbs invariant distribution µs in L2(µ 1 s ).

Note that the existence and uniqueness of ρs(t, ) is ensured by Lemma 2. The convergence guarantee on ρs(t, ) in Lemma 10 relies heavily on the following lemma (Lemma 11). This preparatory lemma introduces the transformation

hs(t, ) = ρs(t, )µ 1 s C1 [0, + ), L2(µs) ,

which allows us to work in the space L2(µs) in place of L2(µ 1 s ) (a measurable function g

is said to belong to L2(µs) if g µs := R

2 < + ).11 It is not hard to show that hs satisﬁes the following equation

t = f hs + s

with the initial distribution hs(0, ) = ρµ 1 s L2(µs). The linear operator

has a crucial property, as stated in the following lemma, whose proof is provided in Appendix C.2.

Lemma 11 The linear operator Ls in (16) is self-adjoint and nonpositive in L2(µs). Explicitly, for any g1, g2, this operator obeys Z

Rd(Lsg1)g2dµs = Z

Rd g1Lsg2dµs = s

Rd g1 g2dµs.

11. Here, dµs stands for the probability measure dµs µsdx = 1 Zs exp( 2f/s)dx.

Shi, Su, and Jordan

Linear convergence. We turn to the proof of linear convergence. We ﬁrst state a lemma which serves as a fundamental tool for us to prove a linear rate of convergence for Proposition 4.

Lemma 12 (Theorem A.1 in Villani (2009)) If f satisﬁes both the conﬁning condition and the Villani condition, then there exists λs > 0 such that the measure dµs satisﬁes the following Poincar e-type inequality

for any h such that the integrals are well deﬁned.

For completeness, we provide a proof of this Poincar e-type inequality in Appendix C.3. For comparison, the usual Poincar e inequality is put into use for a bounded domain, as opposed to the entire Euclidean space as in Lemma 12. In addition, while the constant in the Poincar e inequality in general depends on the dimension (see, for example, (Evans, 2010, Theorem 1, Chapter 5.8)), λs in Lemma 12 is completely determined by geometric properties of the objective f. See details in Section 6. Importantly, Lemma 12 allows us to obtain the following lemma, from which the proof of Proposition 4 follows readily. The proof of this lemma is given at the end of this subsection.

Lemma 13 Under the assumptions of Proposition 4, ρs(t, ) converges to the Gibbs invariant distribution µs in L2(µ 1 s ) at the rate

ρs(t, ) µs µ 1 s e λst ρ µs µ 1 s . (17)

Proof [Proof of Proposition 4] Using Lemma 13, we get

|Ef(Xs(t)) Ef(X( ))| =

Rd(f(x) f ) (ρs(t, x) µs(x)) dx

Rd(f(x) f )2µs(x)dx 1

Rd (ρs(t, x) µs(x))2 µ 1 s dx 1

C(s)e λst ρ µs µ 1 s ,

where the ﬁrst inequality applies the Cauchy-Schwarz inequality and

Rd(f f )2µsdx 1

is an increasing function of s.

We conclude this subsection with the proof of Lemma 13, which is well known and can be found in Bakry et al. (2014) for instance.

On Learning Rates and Schr odinger Operators

Proof [Proof of Lemma 13] It follows from (14) that

d dt ρs(t, ) µs 2 µ 1 s = s Z

Rd hs 2dµs.

Next, using Lemma 12 and recognizing the equality R

Rd hsdµs = R

Rd ρs(t, x)dx = 1, we get

d dt ρs(t, ) µs 2 µ 1 s 2λs

Rd h2 sdµs 1

= 2λs ρs(t, ) µs 2 µ 1 s .

Integrating both sides yields (17), as desired.

5.2 Proof of Proposition 5

Next, we turn to the proof of Proposition 5. We ﬁrst state a technical lemma, deferring its proof to Appendix C.4.

Lemma 14 Under the assumptions of Proposition 5, the excess risk at stationarity ϵ(s) satisﬁes dϵ(0)

Using Lemma 14, we now ﬁnish the proof of Proposition 5. Proof [Proof of Proposition 5] Letting g = f f , we write the excess risk at stationarity as

ϵ(s) = Ef(Xs( )) f =

which yields the following derivative:

Making use of the Cauchy-Schwarz inequality, the derivative satisﬁes dϵ(s)

ds 0 for all s > 0. In fact, the equality holds only in the case of a constant f is a constant, which contradicts both the conﬁning condition and the Villani condition. Hence, the inequality can be strengthened to dϵ(s)

for s > 0. Consequently, we have proven that the excess risk ϵ(s) at stationarity is a strictly increasing function of s [0, + ).

Shi, Su, and Jordan

Next, from Fatou s lemma we get

ϵ(0) lim sup s 0+ ϵ(s) Z

Rd lim s 0+ gµsdx = f f = 0

ϵ(0) lim inf s 0+ ϵ(s) Z

Rd lim s 0+ gµsdx = f f = 0.

As a consequence, ϵ(0) = 0. Lemma 14 shows that for any S > 0, there exists A = AS such that 0 dϵ(s)

ds A for all 0 s S. This fact, combined with ϵ(0) = 0, immediately gives ϵ(s) As for all 0 s S.

6. Geometrizing the Exponential Decay Constant

Having established the linear convergence to stationarity for the LR-dependent SDE, we now oﬀer a quantitative characterization of the exponential decay constant λs for a class of nonconvex objective functions. This is crucial for us to obtain a clear understanding of the dynamics of SGD and especially its dependence on the learning rate in the nonconvex setting.

6.1 Connection with a Schr odinger operator

We begin by deriving a relationship between the LR-dependent SDE (2) and a Schr odinger operator.12 Recall that the probability density ρs(t, ) of the SDE solution is assumed to be in L2(µ 1 s ). Consider the transformation

ψs(t, ) = ρs(t, ) µs L2(Rd).

This transformation allows us to equivalently write the Fokker Planck Smoluchowski equation (5) as ψs

ψs = s + Vs

with the initial condition ψs(0, ) = ρ µs L2(Rd). s + Vs is a Schr odinger perator, where the potential

is positive for suﬃciently large x due to the Villani condition. Now, we collect some basic facts concerning the spectrum of the Schr odinger operator s + Vs. First, it is a positive semideﬁnite operator, as shown below. Recognizing the uniqueness of the Gibbs distribution (6), it is not hard to show that µs is the unique

12. The theory of Schr odinger operators is a major component of classical spectral theory; please see the references Hislop and Sigal (2012); Helﬀer (2013); Reed and Simon (1978).

On Learning Rates and Schr odinger Operators

eigenfunction of s + Vs with a corresponding eigenvalue of zero. Using this fact, from the proof of Lemma 13, we get

( s + Vs)ψs(t, ), ψs(t, ) = ( s + Vs)(ψs(t, ) µs), ψs(t, ) µs

dt ψs(t, ) µs, ψs(t, ) µs

dt ρs(t, ) µs 2 µ 1 s

Rd (ρs(t, )µ 1 s ) 2dµs

where , denotes the standard inner product in L2(Rd). In fact, this inequality can be extended to ( s + Vs)g, g 0 for any g. This veriﬁes the positive semideﬁniteness of the Schr odinger operator s + Vs. Next, making use of the fact that 1

s Vs(x) + as x + , we state the following well-known result in spectral theory that the Schr odinger operator has a purely discrete spectrum in L2(Rd) Hislop and Sigal (2012). A spectrum is said to be discrete if it takes on distinct eigenvalues, with gaps between one value and the next (see, for example, (Hislop and Sigal, 2012, Deﬁnition 1.4)).

Lemma 15 (Theorem 10.7 in Hislop and Sigal (2012)) Assume that V is continuous, and V (x) + as x + . Then the operator + V has a purely discrete spectrum.

Taken together, the positive semideﬁniteness of s + Vs and Lemma 15 allow us to order the eigenvalues of s + Vs in L2(Rd) as

0 = ζs,0 < ζs,1 ζs,ℓ < + .

Let {ψs,i(x)} i=0 represent the eigenfunctions of the Schr odinger operator s + Vs in L2(Rd). The solution to the equivalent form of the Fokker Planck Smoluchowski equation (18) can be expressed in the following form:

i=0 ci(t)ψs,i(x). (19)

By substituting (19) into (18), we obtain the following equality:

i=0 ci(t)ψs,i(x) = 1

i=0 ci(t)( s + Vs)ψs,i(x) = 1

i=0 ζs,ici(t)ψs,i(x).

Additionally, we know that the coeﬃcients decay exponentially in t:

ci(t) = e 1

2 ζs,itci(0).

Shi, Su, and Jordan

Therefore, the closed-form solution to (18) is

2 ζs,itci(0)ψs,i( ).

A crucial fact from this representation is that the exponential decay constant λs in Lemma 13 can be set to

2ζs,1. (20)

To see this, note that ψs(t, ) µs also satisﬁes (18) and is orthogonal to the null eigenfunction µs. Therefore, the norm of ψs(t, ) µs must decay exponentially at a rate determined by half of the smallest positive eigenvalue of Hs.13 That is, we have

ψs(t, ) µs, ψs(t, ) µs e 2 ζs,1

2 t ψs(0, ) µs, ψs(0, ) µs

= e ζs,1t ψs(0, ) µs, ψs(0, ) µs ,

which is equivalent to

ρs(t, ) µs µ 1 s e ζs,1

2 t ρ µs µ 1 s .

As such, we can take λs = 1

2ζs,1 in the proof of Lemma 13. As a consequence of this discussion, we seek to study the Fokker Planck Smoluchowski equation (5) by analyzing the spectrum of the linear Schr odinger operator (18), especially its smallest positive eigenvalue δs,1. To facilitate the analysis, a crucial observation is that this Schr odinger operator is equivalent to the Witten-Laplacian,

s f := s( s + Vs) = s2 + f 2 s f, (21)

by a simple scaling. Denoting by the eigenvalues of the Witten-Laplacian as 0 = δs,0 < δs,1 δs,ℓ < + , we obtain the simple relationship

δs,ℓ= sζs,ℓ,

for all ℓ. The spectrum of the Witten-Laplacian has been the subject of a large literature Helﬀer and Nier (2005); Bovier et al. (2005); Nier (2004); Arnol d and Khesin (1999), and in the next subsection, we exploit this literature to derive a closed-from expression for the ﬁrst positive eigenvalue of the Witten-Laplacian, thereby obtaining the dependence of the exponential decay constant on the learning rate for a certain class of nonconvex objective functions H erau et al. (2011); Michel (2019).

13. Here, the norm of ψs(t, ) µs is induced by the inner product in L2(Rd). That is,

ψ(t, ) µs L2(Rd) = q

ψ(t, ) µs, ψ(t, ) µs .

On Learning Rates and Schr odinger Operators

6.2 The spectrum of the Witten-Laplacian: nonconvex Morse functions

We proceed by imposing the mild condition on the objective function that its ﬁrst-order and second-order derivatives cannot be both degenerate anywhere. Put diﬀerently, the objective function is a Morse function. This allows us to use the theory of Morse functions to provide a geometric interpretation of the spectrum of the Witten-Laplacian.

Basics of Morse theory. We give a brief introduction to Morse theory at the minimum level that is necessary for our analysis. Let f be an inﬁnitely diﬀerentiable function deﬁned on Rn. A point x is called a critical point if the gradient f(x) = 0. A function f is said to be a Morse function if for any critical point x, the Hessian 2f(x) at x is nondegenerate; that is, all the eigenvalues of the Hessian are nonzero. The objective f is assumed to be a Morse function throughout Section 6.2. Note also that we refer to a point x as a local minimum if x is a critical point and all eigenvalues of the Hessian at x are positive. Next, we deﬁne a certain type of saddle point. To this end, let η1(x) η2(x) ηd(x) be the eigenvalues of the Hessian 2f(x) at x.14 A critical point x is said to be an index-1 saddle point if the Hessian at x has exactly one negative eigenvalue, that is, η1(x) ηd 1(x) > 0, ηd(x) < 0. Of particular importance to this paper is a special kind of index-1 saddle point that will be used to characterize the exponential decay constant. Letting Kν := x Rd : f(x) < ν denote the sublevel set at level ν, for an index-1 saddle point x, it is intuitive to imagine that the set Kf(x) {x : x x < r} can be partitioned into two connected components, say C1(x, r) and C2(x, r), if the radius r is suﬃciently small. The following deﬁnition rigorously diﬀerentiates index-1 separating saddle points from the other saddle points.

Deﬁnition 16 Let x be an index-1 saddle point and r > 0 be suﬃciently small. If C1(x, r) and C2(x, r) are contained in two diﬀerent (maximal) connected components of the sublevel set Kf(x), we call x an index-1 separating saddle point.

The remainder of this section aims to relate index-1 separating saddle points to the convergence rate of the LR-dependent SDE. For ease of reading, the remainder of the paper uses x to denote an index-1 separating saddle point and writes X for the set of all these points. To give a geometric interpretation of Deﬁnition 16, let x 1 and x 2 denote local minima in the two maximal connected components of Kf(x ), respectively. Intuitively speaking, the index-1 separating saddle point x is the bottleneck of any path connecting the two local minima. More precisely, along a path connecting x 1 and x 2, by deﬁnition the function f must attain a value that is at least as large as f(x ). In this regard, the function value at x plays a fundamental role in determining how long it takes for the LR-dependent SDE initialized at x 1 to arrive at x 2. See an illustration in Figure 9. As is assumed in this section, f is a Morse function and satisﬁes both the conﬁning and the Villani conditions; in this case, it can be shown that the number of the critical points of f is ﬁnite. Thus, denote by n the number of index-1 separating saddle points of f and let n denote the number of local minima.

14. Note that here we order the eigenvalues from the largest to the smallest, as opposed to the case of the Schr odinger operator previously.

Shi, Su, and Jordan

Figure 9: The landscape of a two-dimensional nonconvex Morse function. Here, x 1 and x 2 denote two local minima. Both x and x+ are index-1 saddle points, but only the former is an index-1 separating saddle point since f(x ) < f(x ). In the two bottom plots, the deep blue regions form the sublevel sets at f(x ) or f(x ). Note that the sublevel set induced by x is the union of two connected components.

H erau Hitrik Sj ostrand s generic case. To describe the labeling procedure, consider the set of the objective values at index-1 separating saddle points V = {f(x ) : x X }. This is a ﬁnite set and we use I to denote the cardinality of this set. Write V = {ν1, . . . , νI} and sort these values as + = ν0 > ν1 > > νI, (22)

where by convention ν0 = + corresponds to a ﬁctive saddle point at inﬁnity. Next, we follow H erau et al. (2011) and deﬁne a type of connected components of sublevel set.

Deﬁnition 17 A connected component E of the sublevel set Kν for some ν V is called a critical component if either E X = or E = Rd, where E is the boundary of E.

In this deﬁnition, the case of E = Rd applies only if ν = ν0 = + . If ν = νi for some 1 i I is only attained by one index-1 separating saddle point, the sublevel set Kνi has two critical components. See Deﬁnition 16 for more details.

On Learning Rates and Schr odinger Operators

With these preparatory notions in place, we describe the following procedure for labeling index-1 separating saddle points and local minima H erau et al. (2011). See Figure 10 for an illustration of this process.

E2 1 E2 2 E2 3

Figure 10: A generic one-dimensional Morse function. The labeling process gives rise to a one-toone correspondence between the local minimum x ij and the index-1 separating saddle point x i,j (which are also local maxima) for all i, j.

1. Let E0 1 := Rd. Note that the global minimum x is contained in E0 1 and denote

x 0 := x = argmin x E0 1 f(x).

Let X 0 denote the singleton set {x }.

2. Let E1 j for j = 1, . . . , m1 be the critical components of the sublevel set Kν1. Note that E1 1 E1 m1 is a (proper) subset of Kν1. Without loss of generality, assume x E1 m1. Then, we select x 1,j1 as

x 1,j1 = argmin x E1 j1 f(x).

Deﬁne X 1 := {x 1,1, . . . , x 1,m1 1}.

3. For i = 2, . . . , I, let Ei j for j = 1, . . . , mi be the critical components of the sublevel set Kνi. Without loss of generality, we assume that the critical components are ordered

Shi, Su, and Jordan

such that there exists an integer ki mi satisfying

Ei j \ i 1 [

for any j = ki + 1, . . . , mi. Set x i,j to

x i,j = argmin x Ei j f(x),

for j = 1, . . . , ki. Deﬁne X i := {x i,1, . . . , x i,ki}.

To make the labeling process above valid, however, we need to impose the following assumption on the objective. This assumption is generic in the sense that it should be satisﬁed by a generic Morse function.

Assumption 18 (Generic case H erau et al. (2011)) For every critical component Ei j selected in the labeling process above, where i = 0, 1, . . . , I, we assume that

The minimum x i,j of f in any critical component Ei j is unique.

If Ei j X = , there exists a unique x i,j Ei j X such that f(x i,j) = max x Ei j X f(x).

In particular, Ei j Kf(x i,j) is the union of two distinct critical components.

The ﬁrst condition in this assumption requires that there exists a unique minimum of the objective f in every critical component Ei j. In particular, the global minimum x is unique under this assumption. In addition, the second condition requires that among all index-1 separating saddle points in Ei j, if any, f attains the maximum at exactly one of these points. Under Assumption 18, the above labeling process includes all the local minima of f. Moreover, it reveals a remarkable result: there exists a bijection between the set of local minima and the set of index-1 separating saddle points (including the ﬁctive one) X { }. As shown in the labeling process, for any local minimum x i,j, we can relate it to the index-1 separating saddle point at which f attains the maximum in the critical component Ei j. See Figure 10 for an illustrative example. Interestingly, this shows that the number of local minima is always larger than the number of index-1 separating saddle points by one; that is, n = n 1. In light of these facts, we can relabel the index-1 separating saddle points x ℓfor ℓ= 0, 1, . . . , n with x 0 = , and the local minima x ℓfor ℓ= 0, 1, . . . , n 1 with x 0 = x , such that f(x 0) f(x 0) > f(x 1) f(x 1) . . . f(x n 1) f(x n 1), (23)

where f(x 0) f(x 0) = f( ) f(x ) = + . A detailed description of this bijection is given in (H erau et al., 2011, Proposition 5.2).

On Learning Rates and Schr odinger Operators

With the pairs (x ℓ, x ℓ) in place, we readily state the following fundamental result concerning the ﬁrst n 1 smallest positive eigenvalues of the Witten-Laplacian s f in (21). Recall that the nonconvex Morse function f satisﬁes the conﬁning condition and the Villani condition.

Proposition 19 (Theorem 1.2 in H erau et al. (2011)) Under Assumption 18 and the assumptions of Theorem 2, there exists s0 > 0 such that for any s (0, s0], the ﬁrst n 1 smallest positive eigenvalues of the Witten-Laplacian s f associated with f satisfy

δs,ℓ= s (γℓ+ o(s)) e 2(f(x ℓ) f(x ℓ)) s

for ℓ= 1, 1, . . . , n 1, where

γℓ= |ηd(x ℓ)| π

det( 2f(x ℓ)) det( 2f(x ℓ))

and ηd(x ℓ) is the unique negative eigenvalue of 2f(x ℓ).

Using Proposition 19 in conjunction with the simple relationship between the exponential decay constant and the spectrum of the Schr odinger operator/Witten-Laplacian (20), it is a stone s throw to prove Theorem 2 when f is generic. First, we give the deﬁnition of the Morse saddle barrier.

Deﬁnition 20 Let f satisfy the assumptions of Theorem 2. We call Hf = f(x 1) f(x 1) the Morse saddle barrier of f.

Proof [Proof of Theorem 2 in the generic case] By Proposition 19, we can set the exponential decay constant to

|ηd(x 1)| 2π

det( 2f(x 1)) det( 2f(x 1))

in Theorem 2. Taking α = 1

2 |ηd(x 1)| 2π det( 2f(x 1)) det( 2f(x 1)) 1

2 in (9), we complete the proof when f falls into the generic case.

However, the generic assumption for the labeling process is complex, leading to the lack of a geometric interpretation of the objective function required for the labeling process. To gain further insight, we present a simplifying assumption that is a special case of Assumption 18. This simpliﬁcation is due to Nier (2004).

Assumption 21 (Simpliﬁed generic case Nier (2004)) The objective functions f takes diﬀerent values at its local minima and index-1 separating saddle points. That is, letting x1 be a local minimum or an index-1 separating saddle point, and x2 likewise, then f(x1) = f(x2). Furthermore, the diﬀerences f(x ℓ1) f(x ℓ2) are distinct for any ℓ1 and ℓ2.

The following result follows immediately from Proposition 19.

Corollary 22 (Theorem 3.1 in Nier (2004)) Under Assumption 21 and the assumptions of Theorem 2, Proposition 19 holds. Therefore, Theorem 2 holds in this case.

Shi, Su, and Jordan

Michel s degenerate case. We say that a Morse function is degenerate if it satisﬁes the assumptions of Theorem 2 but not Assumption 18. To violate the generic assumption, for example, we can change the objective value f(x 3,1) to f(x 1,1) or change f(x 3,2) to f(x 2,3) in Figure 10. In this situation, the ﬁrst condition in Assumption 18 is not satisﬁed. Alternatively, if the objective value at x 3,1 is changed to f(x 2,1), the second condition in Assumption 18 is not met. Figure 11 presents an example of a degenerate Morse function. The main challenge in the degenerate case is the lack of uniqueness of the pairs (x ℓ, x ℓ) derived from the labeling process. Nevertheless, the uniqueness can be maintained if we work on the function values. Explicitly, the labeling process can be adapted to the degenerate case and still yields unique pairs (f(x ℓ), f(x ℓ)) obeying

f( ) f(x ) = f(x 0) f(x 0) > f(x 1) f(x 1) . . . f(x n 1) f(x n 1).

In particular, the number of local minima remains larger than that of index-1 separating saddle points by one in this case. The following result extends Proposition 19 to the degenerate case, which is adapted from Theorem 2.8 of Michel (2019).

E2 1 E2 2 E2 3 E2 4

Figure 11: A degenerate one-dimensional Morse function. The labeling of its index-1 separating saddle points x i,j and local minima x i,j is not unique. Nevertheless, the labeling process gives a unique one-to-one correspondence between the function values at the two types of points. See Figure 10 for a comparison.

Proposition 23 (Theorem 2.8 in Michel (2019)) Assume that the assumptions of Theorem 2 are satisﬁed but not Assumption 18. Then, there exists s0 > 0 such that for any

On Learning Rates and Schr odinger Operators

s (0, s0], the ﬁrst n 1 smallest positive eigenvalues of the Witten-Laplacian s f associated with f satisfy

δs,ℓ= s (γℓ+ o(s)) e 2Hf,ℓ

for ℓ= 1, . . . , n 1, where f(x ℓ) f(x ℓ) Hf,ℓ f(x 1) f(x ). The constants Hf,ℓand γℓall depend only on the function f.

Taken together, Proposition 19 and Proposition 23 yield a full proof of Theorem 2. As is clear, the Morse saddle barrier in Deﬁnition 20 for the degenerate case is set to Hf = Hf,1. For completeness, we remark that this result applies to Assumption 18, in which case we conclude that Hf,ℓ= f(x ℓ) f(x ℓ) and γℓis given the same as (24). As such, Proposition 19 is implied by Proposition 23.

7. Discussion

In this paper, we have presented a theoretical perspective on the convergence of SGD in nonconvex optimization as a function of the learning rate. Introducing the notion of an LRdependent SDE, we have leveraged advanced tools from the study of diﬀusions, in particular the spectral theory of diﬀusion operators, to analyze the dynamics of SGD in a continuoustime model. Our ﬁndings demonstare that, under certain regularity conditions, the solution to the SDE converges linearly to stationarity. Additionally, we have presented a concise expression for the linear rate of convergence, which transparently depend on the learning rate for nonconvex Morse functions. Our results show that the linear rate is a constant in the strongly convex case, whereas it decreases rapidly as the learning rate decreases in the nonconvex setting. We have thus uncovered a fundamental distinction between convex and nonconvex problems. As one implication, we note that noise in the gradients plays a more determinative role in stochastic optimization with nonconvex objectives as opposed to convex objectives. We also note that our results provide a justiﬁcation for the use of a large initial learning rate in training neural networks. We suggest several avenues for future research to enhance and extend the framework for analyzing stochastic optimization methods via SDEs. One area of particular interest is to explore optimization problems where the objective is not L-smooth.15 It would be intriguing to extend convex quadratic optimization to the inﬁnite-dimensional case, which involves unbounded linear operators. Such an extension would provide valuable insights into the convergence behaviors of the discrete SGD based on the dynamics of the LR-dependent SDE. It is worth noting that in the ﬁnite-dimensional case, gradient descent converges linearly to the convex quadratic function. This result is derived by taking the continuous gradient ﬂow as a perspective rather than relying on the error estimate from the numerical method (Proposition 8). With this in mind, it is reasonable to ask whether Theorem 3 can be improved to Ef(xk) f O(s + (1 λss)k).

It is important to highlight that for any learning rate s, there exists some τ > 0 such that when λ > τ, the sequence {(1 λs)k} k=0 diverges as k increases. This divergence exhibits a diﬀerent iteration behavior compared to e λt, so this direct analogy may not hold. However,

15. For a rigorous deﬁnition of L-smooth objective functions, please refer to (Shi et al., 2022, Section 1.4).

Shi, Su, and Jordan

in the linear case, considering the implicit discretization still works, it is possible to obtain the following bound in the inﬁnite-dimensional case:

Ef(xk) f O(s + (1 + λss) k).

More generally, it would be of interest to extend our results to SDEs with variable-dependence noise variance Dieuleveut et al. (2017); Chaudhari and Soatto (2018); Li et al. (2019a). To widen the scope of this framework, it is important to extend our results to the setting where the gradient noise is heavy-tailed Simsekli et al. (2019). Additionally, from a diﬀerent angle, it is noteworthy that (s/2) ρs in the Fokker Planck Smoluchowski equation (5) corresponds to vanishing viscosity in ﬂuid mechanics. Appendix B.3 presents several open problems from this viewpoint. Another potential direction that go beyond the scope of the L-smooth condition is to explore the optimization problems involving ﬁnite-dimensional objective functions with stronger nonlinearity such as the quartic function f = x 4. It is worth noting that while the continuous gradient ﬂow always converges, we cannot guarantee the convergence of the discrete gradient descent from arbitrary initial x0 Rd. From a diﬀerent perspective, we can view the quartic function f = x 4 as 2f(x) 2 L0 + L1 x 2 with L0 = 0 and L1 = 12, which can be seen as another natural generalization of L-smooth objective functions. Behind the remarkable success of deep learning in the industry, the variants of SGD widely used in practice are the Ada-series, including Adagrad Duchi et al. (2011), Adam Tieleman and Hinton (2012) and RMSProp Kingma and Ba (2014). Let us take the Adagrad as a representative example. In the deterministic case, the Adagrad can be written as follows:

xk+1 = xk s f(xk) s

i=0 f(xi) 2 . (25)

With the above generalized L-smooth condition, it is not hard to show the average of iterates of the Adagrad (25) converges as

f x0 + + xk 1

Furthermore, by performing some basic transformations, we can rewrite this equation (25) as xk+1 xk

s2 = f(xk) s

i=0 s2 f(xi) 2 .

Then, by taking the lowest-order continuous limit, we obtain the Adagrad ﬂow as

X = f(X) q R t 0 f(X) 2ds . (26)

By considering the error, f(X) f(x ), as a Lyapunov function, it becomes evident that the error decreases in (26). Furthermore, it is also possible to explore the evolution of the

On Learning Rates and Schr odinger Operators

probability density by introducing noise to the nonconvex objective function. In practical terms, it seems promising to extend our SDE-based analysis to various learning rate schedules used in practice in training deep neural networks, such as diminishing learning rate and cyclical learning rates Bottou et al. (2018); Smith (2017). We note also that our results could be useful in guiding the choice of hyperparameters of deep neural networks from an optimization viewpoint. For instance, recognizing the essence of the exponential decay constant λs in determining the convergence rate of SGD, it is of interest to consider how to choose the neural network architecture and the loss function so as to get a small value of the Morse saddle barrier Hf. Indeed, Nelson (1966) proposed a stochastic interpretation of quantum mechanics, indicating that the non-deterministic nature of quantum particles could be explained by a stochastic process similar to Brownian motion in classical mechanics. Moreover, the concept of stochastic quantization is introduced in (Parisi and Wu, 1981) to simulate the classical ﬁeld theory. Currently, based on the reverse viewpoint of stochastic quantization, the quantum algorithm has been explored to speed up the computation compared to local update Metropolis sampling as the ratio of the barrier height over the temperature ratio increases (Mazzola, 2021). Finally, we wonder if the LR-dependent SDE might give insights into generalization properties of neural networks such as local elasticity He and Su (2020) and implicit regularization Zhang et al. (2016); Gunasekar et al. (2018).

Acknowledgments

We would like to thank Zhuang Liu and Yu Sun for helpful conversations about the practical aspects of deep learning. We thank the action editor and three referees for their constructive comments that helped improve the presentation of this work. Bin Shi was supported by grant no.YSBR-034 of CAS. This work was also supported in part by NSF through CAREER DMS-1847415, CCF-1763314, CCF-1934876, and the Wharton Dean s Research Fund. In addition, we recognize support from the Mathematical Data Science program of the Oﬃce of Naval Research under grant number N00014-18-1-2764.

V. Arnol d. Geometrical Methods in the Theory of Ordinary Diﬀerential Equations. Springer Science & Business Media, 2012.

V. Arnol d. Mathematical Methods of Classical Mechanics. Springer Science & Business Media, 2013.

V. I. Arnol d and B. A. Khesin. Topological Methods in Hydrodynamics, volume 125. Springer Science & Business Media, 1999.

D. Bakry, I. Gentil, and M. Ledoux. Analysis and Geometry of Markov Diﬀusion Operators, volume 348. Springer Science & Business Media, 2013.

Shi, Su, and Jordan

Dominique Bakry, Franck Barthe, Patrick Cattiaux, and Arnaud Guillin. A simple proof of the Poincar e inequality for a large class of probability measures. Electronic Communications in Probability, 13:60 66, 2008.

Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diﬀusion operators, volume 103. Springer, Cham, 2014.

V. Bally and D. Talay. The law of the Euler scheme for stochastic diﬀerential equations: Ii. convergence rate of the density. Monte Carlo Methods and Applications, 2(2):93 128, 1996.

Y. Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade, pages 437 478. Springer, 2012.

Vladimir I Bogachev, Nikolai V Krylov, and Michael R ockner. Elliptic and parabolic equations for measures. Russian Mathematical Surveys, 64(6):973, 2009.

L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010, pages 177 186. Springer, 2010.

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018.

A. Bovier, V. Gayrard, and M. Klein. Metastability in reversible diﬀusion processes II: Precise asymptotics for small eigenvalues. Journal of the European Mathematical Society, 7(1):69 99, 2005.

K. Caluya and A. Halder. Gradient ﬂow algorithms for density propagation in stochastic systems. IEEE Transactions on Automatic Control, 2019.

P. Cannarsa and C. Sinestrari. Semiconcave Functions, Hamilton-Jacobi Equations, and Optimal Control. Springer Science & Business Media, 2004.

P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1 10. IEEE, 2018.

P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier. Deep relaxation: partial diﬀerential equations for optimizing deep neural networks. Research in the Mathematical Sciences, 5(3):30, 2018.

G.-Q. Chen and H. Frid. Vanishing viscosity limit for initial-boundary value problems for conservation laws. Contemporary Mathematics, 238:35 51, 1999.

T.-S. Chiang, C.-R. Hwang, and S. J. Sheu. Diﬀusion for global optimization in Rn. SIAM Journal on Control and Optimization, 25(3):737 753, 1987.

A. Chorin and J. Marsden. A Mathematical Introduction to Fluid Mechanics. Springer, 1990.

On Learning Rates and Schr odinger Operators

M. Crandall and P.-L. Lions. Viscosity solutions of Hamilton-Jacobi equations. Transactions of the American Mathematical Society, 277(1):1 42, 1983.

M. Crandall, L. Evans, and P.-L. Lions. Some properties of viscosity solutions of Hamilton Jacobi equations. Transactions of the American Mathematical Society, 282(2):487 502, 1984.

D. Davis, D. Drusvyatskiy, and V. Charisopoulos. Stochastic algorithms with geometric step decay converge linearly on sharp functions. ar Xiv preprint ar Xiv:1907.09547, 2019.

J.-D. Deuschel and D. Stroock. Large deviations, volume 342. American Mathematical Society, 2001.

J. Diakonikolas and M. I. Jordan. Generalized momentum-based methods: A Hamiltonian perspective. ar Xiv preprint ar Xiv:1906.00436, 2019.

A. Dieuleveut, A. Durmus, and F. Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. ar Xiv preprint ar Xiv:1707.06386, 2017.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Andreas Eberle. Reﬂection couplings and contraction rates for diﬀusions. Probability theory and related ﬁelds, 166:851 886, 2016.

L. Evans. On solving certain nonlinear partial diﬀerential equations by accretive operator methods. Israel Journal of Mathematics, 36(3-4):225 247, 1980.

L. Evans. Partial Diﬀerential Equations (Second Edition), volume 19. American Mathematical Society, 2010.

L. Evans. An Introduction to Stochastic Diﬀerential Equations, volume 82. American Mathematical Society, 2012.

M. Freidlin and A. Wentzell. Random Perturbations of Dynamical Systems, volume 260. Springer Science & Business Media, 2012.

S. Gasiorowicz. Quantum Physics. John Wiley & Sons, 2007.

S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832 1841, 2018.

E. Hazan, A. Rakhlin, and P. Bartlett. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, pages 65 72, 2008.

H. He and W. J. Su. The local elasticity of neural networks. In International Conference on Learning Representations (ICLR), 2020.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

Shi, Su, and Jordan

B. Helﬀer and F. Nier. Hypoelliptic estimates and spectral theory for Fokker-Planck operators and Witten Laplacians. Springer, 2005.

B. Helﬀer, M. Klein, and F. Nier. Quantitative analysis of metastability in reversible diﬀusion processes via a Witten complex approach. Mat. Contemp., 26:41 85, 2004.

Bernard Helﬀer. Spectral theory and its applications. Number 139. Cambridge University Press, 2013.

F. H erau, M. Hitrik, and J. Sj ostrand. Tunnel eﬀect and symmetries for Kramers Fokker Planck type operators. Journal of the Institute of Mathematics of Jussieu, 10(3):567 634, 2011.

Morris W Hirsch, Stephen Smale, and Robert L Devaney. Diﬀerential equations, dynamical systems, and an introduction to chaos. Academic press, 2012.

P. Hislop and I. Sigal. Introduction to Spectral Theory: With Applications to Schr odinger Operators, volume 113. Springer Science & Business Media, 2012.

C.-R. Hwang. Laplace s method revisited: weak convergence of probability measures. The Annals of Probability, pages 1177 1182, 1980.

S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors inﬂuencing minima in SGD. ar Xiv preprint ar Xiv:1711.04623, 2017.

S. Jastrzebski, Z. Kenton, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. On the relation between the sharpest directions of DNN loss and the SGD step length. ar Xiv preprint ar Xiv:1807.05031, 2018.

C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points eﬃciently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724 1732. JMLR. org, 2017.

M. I. Jordan. Dynamical, symplectic and stochastic perspectives on gradient-based optimization. In Proceedings of the International Congress of Mathematicians, Rio de Janeiro, volume 1, pages 523 550, 2018.

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. Tang, and P. Tak. On largebatch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836, 2016.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

P. E. Kloeden and E. Platen. The approximation of multiple stochastic integrals. Stochastic Analysis and Applications, 10(4):431 441, 1992.

W. Krichene and P. L. Bartlett. Acceleration and averaging in stochastic descent dynamics. In Advances in Neural Information Processing Systems, pages 6796 6806, 2017.

On Learning Rates and Schr odinger Operators

A Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, University of Toronto, 2009.

P. Kundu, I. Cohen, and D. Dowling. Fluid Mechanics (Fourth Edition). Elsevier, 2008.

H. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer Science & Business Media, 2003.

H. J. Kushner. Asymptotic global behavior for stochastic approximation and diﬀusions with slowly decreasing noise eﬀects: global minimization via Monte Carlo. SIAM Journal on Applied Mathematics, 47(1):169 185, 1987.

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246 1257, 2016.

Q. Li, C. Tai, and W. E. Stochastic modiﬁed equations and adaptive stochastic gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2101 2110. JMLR. org, 2017.

Q. Li, C. Tai, and W. E. Stochastic modiﬁed equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. The Journal of Machine Learning Research, 20 (1):1474 1520, 2019a.

Y. Li, C. Wei, and T. Ma. Towards explaining the regularization eﬀect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems, pages 11669 11680, 2019b.

Z. Li and S. Arora. An exponential learning rate schedule for deep learning. ar Xiv preprint ar Xiv:1910.07454, 2019.

P.-L. Lions. Generalized Solutions of Hamilton-Jacobi Equations, volume 69. London Pitman, 1982.

I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. ar Xiv preprint ar Xiv:1608.03983, 2016.

S. Mandt, M. Hoﬀman, and D. Blei. A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning, pages 354 363, 2016.

P. A. Markowich and C. Villani. On the trend to equilibrium for the Fokker-Planck equation: An interplay between physics and functional analysis. In Physics and Functional Analysis, Matematica Contemporanea (SBM) 19, pages 1 29, 1999.

Guglielmo Mazzola. Sampling, rates, and reaction currents through reverse stochastic quantization on quantum computers. Physical Review A, 104(2):022431, 2021.

L. Michel. About small eigenvalues of Witten Laplacian. Pure and Applied Analysis, 1(2), 2019.

G. N. Mil shtein. Approximate integration of stochastic diﬀerential equations. Theory of Probability & Its Applications, 19(3):557 562, 1975.

Shi, Su, and Jordan

G. N. Mil shtein. Weak approximation of solutions of systems of stochastic diﬀerential equations. Theory of Probability & Its Applications, 30(4):750 766, 1986.

Edward Nelson. Derivation of the schr odinger equation from newtonian mechanics. Physical Review, 150(4):1079, 1966.

F. Nier. Quantitative analysis of metastability in reversible diﬀusion processes via a Witten complex approach. Journ ees Equations aux D eriv ees Partielles, pages 1 17, 2004.

E. Pardoux and D. Talay. Discretization and simulation of stochastic diﬀerential equations. Acta Applicandae Math, 3:23 47, 1985.

G. Parisi and Y. S. Wu. Perturbation theory without gauge ﬁxing. Scientia. Sinica, 24(4): 483 496, 1981.

G. Pavliotis. Stochastic Processes and Applications: Diﬀusion Processes, the Fokker Planck and Langevin Equations, volume 60. Springer, 2014.

M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. In Conference on Learning Theory, pages 1674 1703, 2017.

Michael Reed and Barry Simon. Methods of Modern Mathematical Physics IV: Analysis of Operators, volume 4. Elsevier, 1978.

B. Shi, S. Du, M. Jordan, and W. J. Su. Understanding the acceleration phenomenon via high-resolution diﬀerential equations. ar Xiv preprint ar Xiv:1810.08907, 2018.

Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su. Understanding the acceleration phenomenon via high-resolution diﬀerential equations. Mathematical Programming, 195 (1-2):79 148, 2022.

U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. ar Xiv preprint ar Xiv:1901.06053, 2019.

L. N. Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 464 472. IEEE, 2017.

S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le. Don t decay the learning rate, increase the batch size. ar Xiv preprint ar Xiv:1711.00489, 2017.

M. Sordello and W. J. Su. Robust learning rate selection for stochastic optimization via splitting diagnostic. ar Xiv preprint ar Xiv:1910.08597, 2019.

W. J. Su, S. Boyd, and E. Cand es. A diﬀerential equation for modeling Nesterov s accelerated gradient method: theory and insights. The Journal of Machine Learning Research, 17(1):5312 5354, 2016.

R. Sun. Optimization for deep learning: theory and algorithms. ar Xiv preprint ar Xiv:1912.08957, 2019.

On Learning Rates and Schr odinger Operators

D. Talay. Analyse num erique des equations diﬀ erentielles stochastiques. Ph D thesis, Universit e Aix-Marseille I, 1982.

D. Talay. Eﬃcient numerical schemes for the approximation of expectations of functionals of the solution of a SDE and applications. In Filtering and Control of Random Processes, pages 294 313. Springer, 1984.

T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2):26 31, 2012.

C. Villani. Hypocoercive diﬀusion operators. In International Congress of Mathematicians, volume 3, pages 473 498, 2006.

C. Villani. Hypocoercivity. Memoirs of the American Mathematical Society, 202(950), 2009.

Martin J Wainwright and Michael I Jordan. A variational principle for graphical models. New Directions in Statistical Signal Processing, page 155, 2006.

A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47): E7351 E7358, 2016.

K. You, M. Long, J. Wang, and M. I. Jordan. How does learning rate decay help modern neural networks? ar Xiv preprint ar Xiv:1908.01878, 2019.

M. D. Zeiler. Adadelta: An adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012.

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.

Y. Zhang, P. Liang, and M. Charikar. A hitting time analysis of stochastic gradient Langevin dynamics. In Conference on Learning Theory, pages 1980 2022, 2017.

Appendix A. Technical Details for Sections 1 and 2

A.1 Approximating diﬀerential equations

Figure 12 presents a diagram that shows approximating surrogates for GD, SGD, and SGLD at multiple scales. In the case of SGD, for example, the inclusion of only O(1) terms leads to the ODE X = f(X), whereas the inclusion of up to O( s) terms leads to the LRdependent SDE (2). For GD and SGLD, O( s) terms are not found in the expansion as in the derivation of (2). The O( s)-approximation, therefore, leads to the same diﬀerential equation as the O(1)-approximation for both GD and SGLD.

Shi, Su, and Jordan

GD: xk+1 = xk s f(xk)

SGD: xk+1 = xk s f(xk) + sξk

SGLD: xk+1 = xk s f(xk) + sξk

Gradient ﬂow: X = f(X)

LR-dependent SDE: d X = f(X)dt + sd W

SDE: d X = f(X)dt + d W

O(1)-approximation

O( s)-approximation

O(1)-approximation

O( s)-approximation

O(1)-approximation

O( s)-approximation

Figure 12: Diagram showing the relationship between three discrete algorithms and their O(1)- approximating and O(1) + O( s)-approximating diﬀerential equations. Note that the inclusion of only O(1)-terms does not distinguish between GD and SGD.

A.2 Derivation of the Fokker Planck Smoluchowski equation

To derive the LR-dependent Fokker Planck Smoluchowski equation (5), we ﬁrst state the following lemma.

Lemma 24 (Itˆo s lemma) For any f C (Rd) and g C ([0, + ) Rd), let Xs(t) be the solution to the LR-dependent SDE (2). Then, we have

dg(t, Xs(t)) = g

From this lemma, we get

d E[g(t, Xs(t))|Xs(t )]

dt = E[g(t, Xs(t))|Xs(t )]

t f E[g(t, Xs(t))|Xs(t )]

2 E[g(t, Xs(t))|Xs(t )], (28)

for t t . Setting vs(t , x) = E[g(t, Xs(t))|Xs(t ) = x]. Since E[g(t, Xs(t))|Xs(t ) = x] is invariant with time t, from (28) we see that vs(t , x) satisﬁes the following diﬀerential equation: vs

2 vs, vs(t, x) = g(t, x). (29)

On Learning Rates and Schr odinger Operators

Recognizing the invariance of translation of time and letting us(t t , x) = vs(t , x), we can reduce (29) to the following backward Fokker Planck Smoluchowski equation:

t = f us + s

2 us, us(0, x) = g(t, x). (30)

Next, from the Chapman Kolmogorov equation, we get

ρs(t, x) = Z

Rd ρs(t, x|0, y)ρs(0, y)dy,

where ρs(t, x|0, y) = ρs(X(t) = x|X(0) = y) and by switching the order of the integration, we obtain Z

Rd us(0, x)ρs(t, x)dx = Z

Rd g(t, x)ρs(t, x)dx

Rd g(t, x) Z

Rd ρs(t, x|0, y)ρs(0, y)dy dx

Rd ρs(0, y)us(t, y)dy = Z

Rd ρs(0, x)us(t, x)dx. (31)

Making use of the backward Fokker Planck Smoluchowski equation (30) and switching the order of integration (31), we get Z

Rd us(0, x) ρs(t, x)

Rd us(t, x)

t=0 ρs(0, x)dx

Rd us(0, x) (ρs(0, x) f(x)) + s

2 ρs(0, x) dx.

Hence, we derive the forward Fokker Planck Smoluchowski equation at t = 0 for an arbitrary smooth function us(0, x) = g(t, x). Noting that t = 0 can be replaced by any time t, we complete the derivation of the Fokker Planck Smoluchowski equation.

A.3 The uniqueness of Gibbs invariant distribution

We begin by proving that the probability density µs is an invariant distribution of (5). Plugging

into (5) gives

(µs f) = µs f + µs f = 2

s f 2µs + ( f)µs (32)

sµs f. (33)

Combining (32) and (33) yields

Shi, Su, and Jordan

We now proceed to show that the probability density µs is unique. To derive a contradiction, we assume that there exists another distribution ϑs satisfying the Fokker Planck Smoluchowski equation: (ϑs f) + s

2 ϑs = 0. (34)

Write ϖs = ϑsµ 1 s and recall the operator Ls deﬁned in Section 5.1. We can rewrite (34) as Lsϖs = 0.

Using Lemma 11, we have

Rd(Lsϖs)ϖsdµs = s

Rd ϖs 2dµs 0.

Hence, ϖs must be a constant on Rd. Furthermore, since both µs and ϑs are probability densities, it must be the case that ϖs 1. In other words, ϑs is identical to µs. The proof is complete.

A.4 Proof of Lemma 2

Recall that Section 6.1 shows that the transition probability density ρs(t, x) in C1([0, + ), L2(µ 1 s )) governed by the Fokker Planck Smoluchowski equation (5) is equivalent to the function ψs(t, x) in C1([0, + ), L2(Rd)) governed by (18). Moreover, in Section 6.1, we have shown that the spectrum of the Schr odinger operator s + Vs satisﬁes

0 = ζs,0 < ζs,1 ζs,ℓ < + .

Since L2(Rd) is a Hilbert space, there exists a standard orthogonal basis corresponding to the spectrum of s + Vs:

µs = φs,0, φs,1, . . . , φs,ℓ, . . . L2(Rd).

Then, for any initialization ψs(0, x) L2(Rd), there exist constants cℓ(ℓ= 1, 2, . . .) such that

ψs(0, ) = µs +

ℓ=1 cℓφs,ℓ.

Thus, the solution to the partial diﬀerential equation (18) is

ψs(t, ) = µs +

ℓ=1 cℓe ζs,ℓtφs,ℓ.

Recognizing the transformation ψs(t, ) = ρs(t, )/ µs, we recover

ρs(t, ) = µs +

ℓ=1 cℓe ζs,ℓtφs,ℓ µs.

Note that ζs,ℓis positive for ℓ 1. Thus, the proof is ﬁnished.

On Learning Rates and Schr odinger Operators

Appendix B. Technical Details for Section 3

B.1 Proof of Lemma 7

Here, we prove Lemma 7 using the Bakry Emery theorem, which is a Poincar e-type inequality for µ-strongly convex functions. As a direct consequence of this lemma, the exponential decay constant for strongly convex objectives does not depend on the learning rate s and the ambient dimension d.

Lemma 25 (Bakry Emery theorem) Let f be an inﬁnitely diﬀerentiable function deﬁned on Rd. If f is µ-strongly convex, then the measure dµs satisﬁes the Poincar e-type inequality as in Lemma 12 with λs = µ; that is, for any smooth function h with a compact support, Z

Lemma 25 serves as the main technical tool in the proof of Lemma 7. Its proof is in Appendix B.1.1. Now, we prove the following result using Lemma 25.

Lemma 26 Under the same assumptions as in Lemma 7, ρs(t, ) converges to the Gibbs distribution µs in L2(µ 1 s ) at the rate

ρs(t, ) µs µ 1 s e µt ρs µs µ 1 s . (35)

Proof [Proof of Lemma 26] It follows from (14) that

d dt ρs(t, ) µs 2 µ 1 s = s Z

Rd hs 2dµs.

Next, using Lemma 25 and recognizing the equality R

Rd hsdµs = R

Rd ρs(t, x)dx = 1, we get

d dt ρs(t, ) µs 2 µ 1 s 2µ Z

Rd(hs 1)2dµs = 2µ ρs(t, ) µs 2 µ 1 s .

Integrating both sides yields (35), as desired.

Leveraging Lemma 26, we proceed to complete the proof of Lemma 7. Proof [Proof of Lemma 7] Using Lemma 26, we get

C(s)e µt ρ µs µ 1 s ,

where the ﬁrst inequality applies the Cauchy-Schwarz inequality and

Rd(f f )2µsdx 1

is an increasing function of s.

Shi, Su, and Jordan

B.1.1 Proof of Lemma 25

We introduce two operators Γs and Γs,2 that are built on top of the linear operator Ls deﬁned in (16). For any g1, g2 L2(µs), let

Γs(g1, g2) = 1

2 [Ls(g1g2) g1Lsg2 g2Lsg1] (36)

and Γs,2(g1, g2) = 1

2 [LsΓs(g1, g2) Γs(g1, Lsg2) Γs(g2, Lsg1)] . (37)

A simple relationship between the two operators is described in the following lemma.

Lemma 27 Under the same assumptions as in Lemma 25, for any g L2(µs) we have

Γs,2(g, g) µΓs(g, g).

Proof [Proof of Lemma 27] Note that

Ls(g1g2) = g1( f g2) g2( f g1) + s

2 (g1 g2 + g2 g1 + 2 g1 g2)

and g1Lsh2 = g1 f g2 + s

2g1 g2, g2Lsg1 = g2 f g1 + s

Then, the operator Γs must satisfy

Γs(g, g) = s

2 ( g g) . (38)

Next, together with the equality

1 2 ( g 2) = g ( g) + Tr[( 2g)T ( 2g)],

we obtain that the operator Γs,2 satisﬁes

Γs,2(g, g) = s

2( g)T 2f( g) + s2

4 Tr[( 2g)T ( 2g)], (39)

where Tr is the standard trace of a squared matrix. Recognizing that the objective f is µ-strongly convex, a comparison between (38) and (39) completes the proof.

Recall that hs(t, ) L2(µs) is the solution to the partial diﬀerential equation (15), with the initial condition hs(0, ) = h. Deﬁne

Λ1,s(t) = Z

Rd h2 s(t, )dµs. (40)

The following lemma considers the derivatives of Λ1,s(t).

On Learning Rates and Schr odinger Operators

Lemma 28 Under the same assumptions as in Lemma 25, we have

Λ1,s(t) = 2 Z

Rd Γs(hs, hs)dµs, Λ1,s(t) = 4 Z

Rd Γs,2(hs, hs)dµs. (41)

Proof [Proof of Lemma 28] Taking together (14) and (38), we have Z

Rd Γs(hs, hs)µsdµs = Z

Rd hs Lshsdµs.

Since hs(t, ) L2(µs) is the solution to the partial diﬀerential equation (15), we get

Λ1,s(t) = 2 Z

Rd hs Lshsdµs = 2 Z

Rd Γs(hs, hs)dµs.

Furthermore, by the deﬁnition of Γs,2 and integration by parts, we have16 Z

Rd Γs,2(hs, hs)dµs = Z

Rd(Lshs)2dµs.

From Lemma 11, we know that the linear operator Ls is self-adjoint. Then, we obtain the second derivative as

Λ1,s(t) = 2 Z

Rd(Lshs)2dµs + 2 Z

Rd hs L 2 s hsdµs = 4 Z

Rd Γs,2(hs, hs)dµs.

Finally, we complete the proof of Lemma 25. Proof [Proof of Lemma 25] Using Lemma 27 and Lemma 28, we obtain the following inequality: Λ1,s(t) 2µ Λ1,s(t). (42)

From the deﬁnition of Λ1,s(t), we have

Λ1,s(0) Λ1,s( ) = Z

where the second term on the right-hand side follows from Lemma 10 and Z

Rd hdµs = Z

Rd ρdx = 1.

By Lemma 10, we get hs( , ) 1, which together with (41) gives

Λ1,s(0) Λ1,s( ) = 2 Z

Rd Γs(h, h)dµs = s Z

The ﬁnal equality follows from (38). Integrating both sides of the inequality (42), we have

2µ (Λ1,s(0) Λ1,s( )) Λ1,s(0) Λ1,s( ),

which completes the proof.

16. See the calculation in Bakry et al. (2013).

Shi, Su, and Jordan

B.2 Proof of Proposition 8

By Lemma 2, let ρs(t, ) C1([0, + ), L2(µ 1 s )) denote the unique transition probability density of the solution to the LR-dependent SDE. Taking an expectation, we get

E[Xs(t)] = Z

Rd xρs(t, x)dx.

Hence, the uniqueness has been proved. Using the Cauchy Schwarz inequality and Lemma 13, we obtain:

Rd x(ρs(t, ) µs)dx +

2 e λst ρ µs µ 1 s + 1 < + ,

where the integrability R

Rd x 2µs(x)dx is due to the fact that the objective f satisﬁes the Villani condition. The existence of a global solution to the LR-dependent SDE (2) is thus established. For the strong convergence, the LR-dependent SDE (2) corresponds to the Milstein scheme in numerical methods. The original result is obtained by Milstein Mil shtein (1975) and Talay Talay (1982); Pardoux and Talay (1985), independently. We refer the readers to (Kloeden and Platen, 1992, Theorem 10.3.5 and Theorem 10.6.3), which studies numerical schemes for stochastic diﬀerential equation. For the weak convergence, we can obtain numerical errors by using both the Euler-Maruyama scheme and Milstein scheme. The original result is obtained by Milstein Mil shtein (1986) and Talay Pardoux and Talay (1985); Talay (1984) independently and (Kloeden and Platen, 1992, Theorem 14.5.2) is also a well-known reference. Furthermore, there exists a more accurate estimate of B(T) shown in Bally and Talay (1996). The original proofs in the aforementioned references only assume ﬁnite smoothness such as C6(Rd) for the objective function.

B.3 Connection with vanishing viscosity

Taking s = 0, the zero-viscosity steady-state equation of the Fokker Planck Smoluchowski equation (5) reads (µ0 f) = 0. (43)

A solution to this zero-viscosity steady-state equation takes the form

i=1 ciδ(x xi), with

i=1 ci = 1, (44)

where xi s are critical points of the objective f. As is clear, the solution is not unique. However, we have shown previously that the invariant distribution µs is unique and converges to µs 0(x) = δ(x x )

in the sense of distribution, which is a special case of (44). Clearly, when there exists more than one critical point, µs 0(x) is diﬀerent from µ0(x) in general. In contrast, µs 0(x)

On Learning Rates and Schr odinger Operators

and µ0(x) must be the same for (strictly) convex functions. In light of this comparison, the correspondences between the case s > 0 and the case s = 0 are fundamentally diﬀerent in nonconvex and convex problems. Next, we consider the rate of convergence in the convex setting. Let

where θ > 0. Plugging into the Fokker-Planck-Smoluchowski equation (5), we have

t = θ (xρs)

ρ(0, ) = ρ L2( p

sπ/θeθx2/s)). (45)

The solution to (45) is

θ πs (1 e 2θt) exp

For any φ(x) L2( p

sπ/θeθx2/s), we have

θ πs (1 e 2θt) exp

2θ x + x0e θt !+

φ x0e θt = D δ(x x0e θt), φ(x) E

as s 0, where δ(x x0e θt) denotes the solution to the following zero-viscosity equation

t = (ρ0 f) . (47)

Furthermore, using the following inequality φ

2θ x + x0e θt !

φ x0e θt O s ,

we get ρ(t, x), ψ(x) δ(x x0e θt), ψ(x) at the rate O ( s) for a test function ψ. The phenomenon presented above is called singular perturbation. It appears in mathematical models of boundary layer phenomena (Chorin and Marsden, 1990, Chapter 2.2, Example 1 and Example 2), WKB theory for Schr odinger equations (Gasiorowicz, 2007, Supplement 4A), KAM theory for circle diﬀeomorphisms (Arnol d, 2012, Chapter 2, Section 11) and that for Hamilton systems (Arnol d, 2013, Appendix 8). Moreover, the singular perturbation phenomenon shows that there exists a fundamental distinction between the O(1)-approximating ODE for SGD and the LR-dependent SDE (2). In particular, the learning rate s 0 in the Fokker Planck Smoluchowski equation (5) corresponds to vanishing viscosity. The vanishing viscosity phenomenon was originally observed in ﬂuid mechanics Chorin and Marsden (1990); Kundu et al. (2008), particularly in the degeneration of

Shi, Su, and Jordan

the Navier Stokes equation to the Euler equation Chen and Frid (1999). As a milestone, the vanishing viscosity method has been used to study the Hamilton Jacobi equation Crandall and Lions (1983); Evans (1980); Crandall et al. (1984). In fact, the Fokker Planck Smoluchowski equation (5) and its stationary equation are a form of Hamilton Jacobi equation with a viscosity term, for which the Hamiltonian is

H(x, ρ, ρ) = fρ + f ρ. (48)

The Hamiltonian (48) is diﬀerent from the classical case Lions (1982); Cannarsa and Sinestrari (2004); Evans (2010), which is generally nonlinear in ρ (cf. Burger s equation). Although the Hamiltonian depends linearly on ρ and ρ, the coeﬃcients depend on f and f. Hence, it is not reasonable to apply directly the well-established theory of Hamilton Jacobi equations Crandall and Lions (1983); Evans (1980); Crandall et al. (1984); Lions (1982); Cannarsa and Sinestrari (2004); Evans (2010) to the Fokker Planck Smoluchowski equation (5) and its stationary equation. Furthermore, for the aforementioned example, which proves the O( s) convergence for the Fokker Planck Smoluchowski equation with the quadratic potential f(x) = θ

2x2, is also a viscosity solution to the Hamilton Jacobi equation Crandall and Lions (1983), since the Hamiltonian (48) for the quadratic potential degenerates to

H(x, ρ, ρ) = 2tr(A)ρ + 2Ax ρ,

where f(x) = x T Ax and A is positive deﬁnite and symmetric. Thus, we remark that the general theory of viscosity solutions to Hamilton Jacobi equations cannot be used directly to prove the theorems in the main body of this paper.

In closing, we present several open problems.

Consider the stationary solution µs(x) to the Fokker Planck Smoluchowski equation (5). For a convex or strongly convex objective f with Lipschitz gradients, can we quantify the rate of convergence? Does the rate of convergence remain O( s)?

Let T > 0 be ﬁxed and consider the solution ρs(t, x) to the Fokker Planck Smoluchowski equation (5) in [0, T]. For a convex or strongly convex objective f with Lipschitz gradients, does the solution to the Fokker Planck Smoluchowski equation (5) converge to the solution to its zero-viscosity equation (47)? Is the rate of convergence still O( s)?

Consider the solution ρs(t, x) to the Fokker Planck Smoluchowski equation (5) in [0, + ). For a convex or strongly convex objective f with Lipschitz gradients, does the global solution to the Cauchy problem of the Fokker Planck Smoluchowski equation (5) converge to the solution of its zero-viscosity equation (47)? Is the rate of convergence still O( s)?

On Learning Rates and Schr odinger Operators

Appendix C. Technical Details for Section 5

C.1 Proof of Lemma 9

From the Cauchy Schwarz inequality, we get

Rd |g(x)|dx Z

Rd g2(x)e 2f(x)

This completes the proof.

C.2 Proof of Lemma 11

Recall that the linear operator Ls in (16) is deﬁned as

Note that we have Z

Rd (Lsg1) g2dµs = s

Rd( g1 g2)e 2f

Rd( g1 g2)dµs.

Therefore, Ls is self-adjoint in L2(µs) and is non-positive.

C.3 Proof of Lemma 12

For completeness, we show below the original proof of Theorem 12 from Villani (2009) in detail. Let Vs = f 2/s f, then for any h C c (Rd) with mean-zero condition Z

Rd hdµs = 0, (49)

we can obtain the following key inequality Deuschel and Stroock (2001) Z

Rd Vsh2dµs s Z

Rd h 2dµs. (50)

To show (50), note that

Rd(h2 f)e 2f

Rd h2 f 2e 2f

Recognizing µs e 2f

s , this proves (50). Let R0,s > 0 be large enough such that Vs(x) > 0 for x R0,s. For Rs > R0,s, we can deﬁne ϵs as

ϵs(Rs) := 1 inf{Vs(x) : x Rs}. (51)

Shi, Su, and Jordan

Then ϵ(Rs) 0 as Rs . Furthermore, we assume the Rs is large enough such that Z

From the key inequality (50), we obtain that Z

|x| Rs h2dµs ϵ(Rs) s Z

Rd h 2dµs ( inf x Rd Vs(x)) Z

Let BRs be the ball centered at the origin of radius Rs in Rd and deﬁne

# 1 µs1|x| Rs.

Using the Poincar e inequality in a bounded domain (Evans, 2010, Theorem 1, Chapter 5.8), we get Z

x Rd h2dµs,Rs s C(Rs) Z

x Rd h 2µs,Rsdµs,Rs + Z

x Rd hdµs,Rs

where C(Rs) is a constant depending on Rs. Furthermore, using the inequality (52), we obtain Z

x Rs h2dµs s C(Rs) Z

x Rs h 2dµs + 2

Making use of the mean-zero property of h, we have Z

x >Rs h2dµs. (55)

Combining (54) and (55), we get Z

x Rd h2dµs s C(Rs) Z

x Rd h 2dµs + 3 Z

x Rs h2dµs. (56)

Taking (53) and (56) together, we obtain Z

Rd h2dµs s[C(Rs) + 3ϵ(Rs)] Z

Rd h 2dµs 3( inf x Rd Vs(x)).ϵs(Rs) Z

x Rd h2dµs (57)

Apparently, from the deﬁnition of ϵs(x), we can select Rs > 0 large enough such that 1 + 3s( inf x Rd Vs(x))ϵ(Rs) > 0. Then, we can rewrite (57) as

2 2(C(Rs) + 3ϵ(Rs)) 1 + 3s( inf x Rd Vs(x))ϵ(Rs)

x Rd h 2dµs. (58)

Finally, using h R

Rd hdµs instead of h in the inequality (58), we prove the desired Poincar e inequality by taking

λs = 1 + 3s( inf x Rd Vs(x))ϵ(Rs)

2(C(Rs) + 3ϵ(Rs)) .

On Learning Rates and Schr odinger Operators

C.4 Proof of Lemma 14

For convenience, we introduce a shorthand:

Then, we can rewrite the derivative as

Next, we assume that ζk(x) = xke xα, where α < 1 is a ﬁxed positive constant and k = 1, 2. The facts that ζk(0) = 0 and lim x + ζk(x) = 0 give

lim s 0+ ζk g

Then, by Fatou s lemma, we get

0 lim inf s 0+ dϵ(s)

ds lim sup s 0+ dϵ(s)

= 2 lim sup s 0+

dx 2 lim inf s 0+

Rd lim sup s 0+

Rd lim inf s 0+ g s Π g

The proof is complete.