# fluctuationdissipation_relations_for_stochastic_gradient_descent__0efd95f0.pdf

Published as a conference paper at ICLR 2019

FLUCTUATION-DISSIPATION RELATIONS FOR STOCHASTIC GRADIENT DESCENT

Sho Yaida Facebook AI Research Facebook Inc. Menlo Park, California 94025, USA shoyaida@fb.com

The notion of the stationary equilibrium ensemble has played a central role in statistical mechanics. In machine learning as well, training serves as generalized equilibration that drives the probability distribution of model parameters toward stationarity. Here, we derive stationary ﬂuctuation-dissipation relations that link measurable quantities and hyperparameters in the stochastic gradient descent algorithm. These relations hold exactly for any stationary state and can in particular be used to adaptively set training schedule. We can further use the relations to efﬁciently extract information pertaining to a loss-function landscape such as the magnitudes of its Hessian and anharmonicity. Our claims are empirically veriﬁed.

1 INTRODUCTION

Equilibration rules the long-term fate of many macroscopic dynamical systems. For instance, as we pour water into a glass and let it be, the stationary state of tranquility is eventually attained. Zooming into the tranquil water with a microscope would reveal, however, a turmoil of stochastic ﬂuctuations that maintain the apparent stationarity in balance. This is vividly exempliﬁed by the Brownian motion (Brown, 1828): a pollen immersed in water is constantly bombarded by jittery molecular movements, resulting in the macroscopically observable diffusive motion of the solute. Out of the effort in bridging microscopic and macroscopic realms through the Brownian movement came a prototype of ﬂuctuation-dissipation relations (Einstein, 1905; Von Smoluchowski, 1906). These relations quantitatively link degrees of noisy microscopic ﬂuctuations to smooth macroscopic dissipative phenomena and have since been codiﬁed in the linear response theory for physical systems (Onsager, 1931; Green, 1954; Kubo, 1957), a cornerstone of statistical mechanics.

Machine learning begets another form of equilibration. As a model learns patterns in data, its performance ﬁrst improves and then plateaus, again reaching apparent stationarity. This dynamical process naturally comes equipped with stochastic ﬂuctuations as well: often given data too gigantic to consume at once, training proceeds in small batches and random selections of these mini-batches consequently give rise to the noisy dynamical excursion of the model parameters in the loss-function landscape, reminiscent of the Brownian motion. It is thus natural to wonder if there exist analogous ﬂuctuation-dissipation relations that quantitatively link the noise in mini-batched data to the observable evolution of the model performance and that in turn facilitate the learning process.

Here, we derive such ﬂuctuation-dissipation relations for the stochastic gradient descent algorithm. The only assumption made is stationarity of the probability distribution that governs the model parameters at sufﬁciently long time. Our results thus apply to generic cases with non-Gaussian mini-batch noises and nonconvex loss-function landscapes. Practically, the ﬁrst relation (FDR1) offers the metric for assessing equilibration and yields an adaptive algorithm that sets learning-rate schedule on the ﬂy. The second relation (FDR2) further helps us determine the properties of the lossfunction landscape, including the strength of its Hessian and the degree of anharmonicity, i.e., the deviation from the idealized harmonic limit of a quadratic loss surface and a constant noise matrix.

Our approach should be contrasted with recent attempts to import the machinery of stochastic differential calculus into the study of the stochastic gradient descent algorithm (Mandt et al., 2015; Li et al., 2015; Mandt et al., 2017; Li et al., 2017; Smith & Le, 2018; Chaudhari & Soatto, 2017;

Published as a conference paper at ICLR 2019

Jastrzebski et al., 2017; Zhu et al., 2018; An et al., 2018). This line of work all assumes Gaussian noises and sometimes additionally employs the quadratic harmonic approximation for loss-function landscapes. The more severe drawback, however, is the usage of the analogy with continuous-time stochastic differential equations, which is inconsistent in general (see Section 2.3.3). Instead, the stochastic gradient descent algorithm can be properly treated within the framework of the Kramers Moyal expansion (Van Kampen, 1992; Gardiner, 2009; Risken, 1984; Radons et al., 1990; Leen & Moody, 1993).

The paper is organized as follows. In Section 2, after setting up notations and deriving a stationary ﬂuctuation-dissipation theorem (FDT), we derive two speciﬁc ﬂuctuation-dissipation relations. The ﬁrst relation (FDR1) can be used to check stationarity and the second relation (FDR2) to delineate the shape of the loss-function landscape, as empirically borne out in Section 3. An adaptive scheduling method is proposed and tested in Section 3.3. We conclude in Section 4 with future outlooks.

2 FLUCTUATION-DISSIPATION RELATIONS

A model is parametrized by a weight coordinate, θ = {θi}i=1,...,P . The training set of Ns examples is utilized by the model to learn patterns in the data and the model s overall performance is evaluated by a full-batch loss function, f (θ) 1 Ns PNs α=1 fα (θ), with fα (θ) quantifying the performance of the model on a particular sample α: the smaller the loss is, the better the model is expected to perform. The learning process can thus be cast as an optimization problem of minimizing the loss function. One of the most commonly used optimization schemes is the stochastic gradient descent (SGD) algorithm (Robbins & Monro, 1951) in which a mini-batch B {1, 2, . . . , Ns} of size |B| is stochastically chosen for training at each time step. Speciﬁcally, the update equation is given by

θ(t + 1) = θ(t) η f B [θ(t)] , (1)

where η > 0 is a learning rate and a mini-batch loss f B (θ) 1 |B| P

α B fα (θ). Note that

q f B (θ) y

m.b. = f (θ) , (2)

with J. . .Km.b. denoting the average over mini-batch realizations. For later purposes, it is convenient to deﬁne a full two-point noise matrix e C through1

e Ci,j (θ) q if B (θ) jf B (θ) y

and, more generally, higher-point noise tensors

e Ci1,i2,...,ik (θ) q i1f B (θ) i2f B (θ) ikf B (θ) y

Below, we shall not make any assumptions on the distribution of the noise vector f B other than that a mini-batch is independent and identically distributed from the Ns training samples at each time step and the noise distribution is therefore allowed to have nontrivial higher connected moments indicative of non-Gaussianity.

It is empirically often observed that the performance of the model plateaus after some training through SGD. It is thus natural to hypothesize the existence of a stationary-state distribution, pss (θ), that dictates the SGD sampling at long time (see Section 2.3.4 for discussion on this assumption). For any observable quantity, O (θ), something that can be measured during training such as θ2 and f (θ) its stationary-state average is then deﬁned as

O (θ) Z dθ pss (θ) O (θ) . (5)

1A connected noise covariant matrix, Ci,j (θ) e Ci,j (θ) [ if (θ)] [ jf (θ)], will not appear in

ﬂuctuation-dissipation relations below but scales nicely with mini-batch sizes as 1 |B| 1 |B|

(Li et al., 2017).

Published as a conference paper at ICLR 2019

In general the probability distribution of the model parameters evolves as p(θ, t + 1) = q R dθ p(θ , t)δ θ θ η f B (θ ) y

m.b. and in particular for the stationary state Z dθ pss (θ, t) O (θ) = Z dθ pss (θ, t + 1) O (θ) (6)

= s Z dθ Z dθ pss(θ , t)δ θ θ η f B (θ ) O (θ) {

= Z dθ pss (θ ) q O θ η f B (θ ) y

Thus follows the master equation

O (θ) = q O θ η f B (θ) y

m.b. . (FDT)

In the next two subsections, we apply this general formula to simple observables in order to derive various stationary ﬂuctuation-dissipation relations. Incidentally, the discrete version of the Fokker Planck equation can be derived through the Kramers-Moyal expansion, considering the more general nonstationary version of the above equation and performing the Taylor expansion in η and repeated integrations by parts (Van Kampen, 1992; Gardiner, 2009; Risken, 1984; Radons et al., 1990; Leen & Moody, 1993).

2.1 FIRST FLUCTUATION-DISSIPATION RELATION

Applying the master equation (FDT) to the linear observable,

θ = q θ η f B (θ) y

m.b. = θ η f (θ) . (7)

We thus have f = 0 . (8) This is natural because there is no particular direction that the gradient picks on average as the model parameter stochastically bounces around the local minimum or, more generally, wanders around the loss-function landscape according to the stationary distribution.

Performing similar algebra for the quadratic observable θiθj yields

θi ( jf) + ( if) θj = η D e Ci,j E . (9)

In particular, taking the trace of this matrix-form relation, we obtain

2η D Tr e C E . (FDR1)

More generally, in the case of SGD with momentum µ and dampening ν, whose update equation is given by

v(t + 1) = µv(t) (1 ν) f B [θ(t)] , (10) θ(t + 1) = θ(t) + ηv(t + 1) , (11)

a similar derivation yields (see Appendix A)

θ ( f) = (1 + µ)

2(1 ν)η v2 . (FDR1 )

The last equation reduces to the equation (FDR1) when µ = ν = 0 with v = f B. Also note that θ ( f) = (θ θc) ( f) for an arbitrary constant vector θc because of the equation (8).

This ﬁrst ﬂuctuation-dissipation relation is easy to evaluate on the ﬂy during training, exactly holds without any approximation if sampled well from the stationary distribution, and can thus be used as the standard metric to check if learning has plateaued, just as similar relations can be used to check equilibration in Monte Carlo simulations of physical systems (Santen & Krauth, 2000). [It should be cautioned, however, that the ﬂuctuation-dissipation relations are necessary but not sufﬁcient to ensure stationarity (Odriozola & Berthier, 2011).] Such a metric can in turn be used to schedule changes in hyperparameters, as shall be demonstrated in Section 3.3.

Published as a conference paper at ICLR 2019

2.2 SECOND FLUCTUATION-DISSIPATION RELATION

Applying the master equation (FDT) on the full-batch loss function and Taylor-expanding it in the learning rate η yields the closed-form expression

f (θ) = q f θ η f B (θ) y

i1,i2,...,ik=1 ( i1 i2 ikf) q i1f B i2f B ikf B y

= f η D ( f)2E +

i1,i2,...,ik=1 Fi1,i2,...,ik e Ci1,i2,...,ik

where we recalled the equation (4) and introduced

Fi1,i2,...,ik (θ) i1 i2 ikf (θ) . (13)

In particular, Hi,j (θ) Fi,j (θ) is the Hessian matrix. Reorganizing terms, we obtain

D ( f)2E = η

D Tr H e C E η2

i1,i2,...,ik=1 Fi1,i2,...,ik e Ci1,i2,...,ik

(FDR2) In the case of SGD with momentum and dampening, the left-hand side is replaced by (1

ν) D ( f)2E µ v f and e Ci1,i2,...,ik by more hideous expressions (see Appendix A).

We can extract at least two types of information on the loss-function landscape by evaluating the dependence of the left-hand side, G(η) D ( f)2E , on the learning rate η. First, in the small

learning rate regime, the value of 2G(η)/η approximates Tr H e C around a local ravine. Second,

nonlinearity of G(η) at higher η indicates discernible effects of anharmonicity. In such a regime, the Hessian matrix H cannot be approximated as constant (which also implies that {Fi1,i2,...,ik}k>2 are nontrivial) and/or the noise two-point matrix e C cannot be regarded as constant. Such nonlinearity especially indicates the breakdown of the harmonic approximation, that is, the quadratic truncation of the loss-function landscape, often used to analyze the regime explored at small learning rates.

2.3 REMARKS

2.3.1 INTUITION WITHIN THE HARMONIC APPROXIMATION

In order to gain some intuition about the ﬂuctuation-dissipation relations, let us momentarily employ the harmonic approximation, i.e., assume that there is a local minimum of the loss function at θ = θ and retain only up to quadratic terms of the Taylor expansions around it: f(θ) f0+ 1

2 PP i,j=1 hi,j(θi θ i )(θj θ j ). Within this approximation, θ ( f) = (θ θ ) ( f)

2 f f0 . The relation (FDR1) then becomes f f0 1

4η D Tr e C E , linking the height of the noise ball to the noise amplitude. This is in line with, for instance, the theorem 4.6 of the reference Bottou et al. (2018) and substantiates the analogy between SGD and simulated annealing, with the learning rate η multiplied by Tr e C playing the role of temperature (Bottou, 1991).

2.3.2 HIGHER-ORDER RELATIONS

Additional relations can be derived by repeating similar calculations for higher-order observables. For example, at the cubic order,

θiθj ( kf) + θi ( jf) θk + ( if) θjθk = η D θi e Cj,k + θj e Ck,i + θk e Ci,j E η2 D e Ci,j,k E . (14)

The systematic investigation of higher-order relations is relegated to future work.

Published as a conference paper at ICLR 2019

2.3.3 SGD =SDE

There is no limit in which SGD asymptotically reduces to the stochastic differential equation (SDE). In order to take such a limit with continuous time differential dt 0+, each SGD update must become inﬁnitesimal. One may thus try dt η 0+, as in recent work adapting the view that SGD=SDE (Mandt et al., 2015; Li et al., 2015; Mandt et al., 2017; Li et al., 2017; Smith & Le, 2018; Chaudhari & Soatto, 2017; Jastrzebski et al., 2017; Zhu et al., 2018; An et al., 2018). But this in turn forces the noise vector with zero mean, f B f, to be multiplied by dt. This is in contrast to the scaling

dt needed for the standard machinery of SDE Itˆo-Stratonovich calculus and all that to apply; the additional factor of dt1/2 makes the effective noise covariance be suppressed by dt and the resulting equation in the continuous-time limit, if anything, would just be an ordinary differential equation without noise2 [unless noise with the proper scaling is explicitly added as in stochastic gradient Langevin dynamics (Welling & Teh, 2011; Teh et al., 2016) and natural Langevin dynamics (Marceau-Caron & Ollivier, 2017; Nado et al., 2018)].

In short, the recent work views η = η

dt and sends dt 0+ while pretending that η is ﬁnite, which is inconsistent. This is not just a technical subtlety. When unjustiﬁably passing onto the continuous-time Fokker-Planck equation, the diffusive term is incorrectly governed by the connected two-point noise matrix Ci,j (θ) e Ci,j (θ) [ if (θ)] [ jf (θ)] rather than the full two-point noise matrix e Ci,j (θ) that appears herein.3 We must instead employ the discrete-time version of the Fokker-Planck equation derived in references Van Kampen (1992); Gardiner (2009); Risken (1984); Radons et al. (1990); Leen & Moody (1993), as has been followed in the equation (6).

2.3.4 ON STATIONARITY

In contrast to statistical mechanics where an equilibrium state is dictated by a handful of thermodynamic variables, in machine learning a stationary state generically depends not only on hyperparameters but also on a part of its learning history. The stationarity assumption made herein, which is codiﬁed in the equation (6), is weaker than the typicality assumption underlying statistical mechanics and can hold even in the presence of lingering memory. In the full-batch limit |B| = Ns, for instance, any distribution delta-peaked at a local minimum is stationary. For sufﬁciently small learning rates η as well, it is natural to expect multiple stationary distributions that form disconnected ponds around these minima, which merge upon increasing η and fragment upon decreasing η.

It is beyond the scope of the present paper to formulate conditions under which stationary distributions exist. Indeed, if the formulation were too generic, there could be counterexamples to such a putative existence statement. A case in point is a model with the unregularized cross entropy loss, whose model parameters keep cascading toward inﬁnity in order to sharpen its softmax output (Neyshabur et al., 2014; 2017) with logarithmically diverging θ2 (Soudry et al., 2018). It would be interesting to see if there are any other nontrivial caveats.

3 EMPIRICAL TESTS

In this section we empirically bear out our theoretical claims in the last section. To this end, two simple models of supervised learning are used (see Appendix B for full speciﬁcations): a multilayer perceptron (MLP) learning patterns in the MNIST training data (Le Cun et al., 1998) through SGD without momentum and a convolutional neural network (CNN) learning patterns in the CIFAR10 training data (Krizhevsky & Hinton, 2009) through SGD with momentum µ = 0.9. For both models, the mini-batch size is set to be |B| = 100, and the training data are shufﬂed at each epoch t = Ns

|B|ˆtepoch with ˆtepoch N. In order to avoid the overﬁtting cascade mentioned in Section 2.3.4, the L2-regularization term 1

2λθ2 with the weight decay λ = 0.01 is included in the loss function f.

2One may try to evade this by employing the 1/ |B|-scaling of the connected noise covariant matrix, but that would then enforces |B| 0+ as dt 0+, which is unphysical. 3Heuristically, ( f)2 ηH e C for small η due to the relation FDR2, and one may thus neglect the difference between e C and C, and hence justify the naive use of SDE, when ηH 1 and the Gaussian-noise assumption holds. In the similar vein, the reference Li et al. (2015) proves faster convergence between SGD and SDE when the term proportional to η ( f)2 is added to the gradient.

Published as a conference paper at ICLR 2019

Before proceeding further, let us deﬁne the half-running average of an observable O as

O(t) 1 t t0

t =t0+1 O(t ) with t0 = t/2 . (15)

This is the average of the observable up to the time step t, with the initial half discarded as containing transient. If SGD drives the distribution of the model parameters to stationarity at long time, then lim t O(t) = O . (16)

3.1 FIRST FLUCTUATION-DISSIPATION RELATION AND EQUILIBRATION

In order to assess the proximity to stationarity, deﬁne

OL θ f B and OR (1 + µ)

2(1 ν)ηv2 (17)

(with v replaced by f B for SGD without momentum).4 Both of these observables can easily be measured on the ﬂy at each time step during training and, according to the relation (FDR1 ), the running averages of these two observables should converge to each other upon equilibration.

0 50 100 10-3

(a) MLP on MNIST, µ = 0

0 50 100 10-3

(b) CNN on CIFAR-10, µ = 0.9

Figure 1: Approaches toward stationarity during the initial trainings for the MLP on the MNIST data (a) and for the CNN on the CIFAR-10 data (b). Top panels depict the half-running average f B(t) (dark green) and the instantaneous value f B(t) (light green) of the mini-batch loss. Bottom panels depict the convergence of the half-running averages of the observables OL = θ f B and OR = (1+µ) 2(1 ν)ηv2, whose stationary-state averages should agree according to the relation (FDR1 ).

In order to verify this claim, we ﬁrst train the model with the learning rate η = 0.1 for ˆttotal epoch = 100 epochs, that is, for ttotal = Ns

|B|ˆttotal epoch = 100 Ns

|B| time steps. As shown in the ﬁgure 1, the observables

OL(t) and OR(t) converge to each other. We then take the model at the end of the initial 100epoch training and sequentially train it further at various learning rates η (see Appendix B). The observables OL(t) and OR(t) again converge to each other, as plotted in the ﬁgure 2. Note that the smaller the learning rate is, the longer it takes to equilibrate.

3.2 SECOND FLUCTUATION-DISSIPATION RELATION AND SHAPE OF LOSS-FUNCTION LANDSCAPE

In order to assess the loss-function landscape information from the relation (FDR2), deﬁne

OFB (1 ν) ( f)2 µv f B (18)

4If the model parameter θ happens to ﬂuctuate around large values, for numerical accuracy, one may want to replace OL = θ f B by (θ θc) f B where a constant vector θc approximates the vector around which θ ﬂuctuates at long time.

Published as a conference paper at ICLR 2019

(a) MLP on MNIST, µ = 0

(b) CNN on CIFAR-10, µ = 0.9

Figure 2: Approaches toward stationarity during the sequential runs for various learning rates η, seen through the half-running averages of the observables OL = θ f B (solid) and OR = (1+µ) 2(1 ν)ηv2

(dotted light-colored). They agree at sufﬁciently long times but the relaxation time to reach such a stationary regime increases as the learning rate η decreases.

(with the second term nonexistent for SGD without momentum).5 Note that ( f)2 is a full-batch not mini-batch quantity. Given its computational cost, here we measure this ﬁrst term only at the end of each epoch and take the half-running average over these sparse sample points, discarding the initial half of the run.

The half-running average of the full-batch observable OFB at the end of sufﬁciently long training, which is a good proxy for OFB , is plotted in the ﬁgure 3 as a function of the learning rate η. As predicted by the relation (FDR2), at small learning rates η, the observable OFB approaches

zero; its slope divided by D Tr e C E if preferred measures the magnitude of the Hessian matrix, component-wise averaged over directions in which the noise preferentially ﬂuctuates. Meanwhile, nonlinearity at higher learning rates η measures the degree of anharmonicity experienced over the distribution pss (θ). We see that anharmonic effects are pronounced especially for the CNN on the CIFAR-10 data even at moderately small learning rates. This invalidates the use of the quadratic harmonic approximation for the loss-function landscape and/or the assumption of the constant noise matrix for this model except at very small learning rates.

3.3 FIRST FLUCTUATION-DISSIPATION RELATION AND LEARNING-RATE SCHEDULES

Saturation of the relation (FDR1) suggests the learning stationarity, at which point it might be wise to decrease the learning rate η. Such scheduling is often carried out in an ad hoc manner but we can now algorithmize this procedure as follows:

1. Evaluate the half-running averages OL(t) and OR(t) at the end of each epoch.

2. If OL(t)

OR(t) 1 < X, then decrease the learning rate as η (1 Y )η and also set t = 0 for the purpose of evaluating half-running averages.

Here, two scheduling hyperparameters X and Y are introduced, which control the threshold for saturation of the relation (FDR1) and the amount of decrease in the learning rate, respectively.

Plotted in the ﬁgure 4 are results for SGD without momentum, with the Xavier initialization (Glorot & Bengio, 2010) and training through (i) preset training schedule with decrease of the learning rate by a factor of 10 for each 100 epochs, (ii) an adaptive scheduler with X = 0.01 (1% threshold) and

5For the second term, in order to ensure that limt v f B(t) = limt v f(t), we measure the half-running average of v (t) f B [θ (t)] and not v (t + 1) f B [θ (t)].

Published as a conference paper at ICLR 2019

0.00 0.02 0.04 0.06 0.00

(a) MLP on MNIST, µ = 0

0.00 0.02 0.04 0.06 0

(b) CNN on CIFAR-10, µ = 0.9

Figure 3: The stationary-state average of the full-batch observable OFB as a function of the learning rate η, estimated through half-running averages. Dots and error bars denote mean values and 95% conﬁdence intervals over several distinct runs, respectively. The straight red line connects the origin and the point with the smallest η explored. (a) For the MLP on the MNIST data, linear dependence on η for η 0.01 supports the validity of the harmonic approximation there. (b) For the CNN on the CIFAR-10 data, anharmonicity is pronounced even down to η 0.001.

Y = 0.1 (10% decrease), and (iii) the AMSGrad algorithm (J. Reddi et al., 2018) with the default hyperparameters. The adaptive scheduler attains comparable accuracies with the preset scheduling at long time and outperforms AMSGrad (see Appendix C for additional simulations).

0 100 200 300 400

(a) MLP on MNIST, µ = 0

0 100 200 300 400

(b) CNN on CIFAR-10, µ = 0

Figure 4: Comparison of preset training schedule (black) and adaptive training schedule (blue), employing SGD without momentum both for the MLP on the MNIST data (a) and the CNN on the CIFAR-10 data (b), along with the AMSGrad algorithm (green). From top to bottom, plotted are the learning rate η, the full-batch training loss f, and prediction accuracies on the training-set images (solid) and the 10000 test-set images (dashed).

These two scheduling methods span different subspaces of all the possible schedules. The adaptive scheduling method proposed herein has a theoretical grounding and in practice much less dimensionality for tuning of scheduling hyperparameters than the presetting method, thus ameliorating the optimization of scheduling hyperparameters. The systematic comparison between the two scheduling methods for state-of-the-arts architectures, and also the comparison with the AMSGrad algorithm for natural language processing tasks, could be a worthwhile avenue to pursue in the future.

Published as a conference paper at ICLR 2019

4 CONCLUSION

In this paper, we have derived the ﬂuctuation-dissipation relations with no assumptions other than stationarity of the probability distribution. These relations hold exactly even when the noise is non Gaussian and the loss function is nonconvex. The relations have been empirically veriﬁed and used to probe the properties of the loss-function landscapes for the simple models. The relations further have resulted in the algorithm to adaptively set learning-rate schedule on the ﬂy rather than presetting it in an ad hoc manner. In addition to systematically testing the performance of this adaptive scheduling algorithm, it would be interesting to investigate non-Gaussianity and noncovexity in more details through higher-point observables, both analytically and numerically. It would also be interesting to further elucidate the physics of machine learning by extending our formalism to incorporate nonstationary dynamics, linearly away from stationarity (Onsager, 1931; Green, 1954; Kubo, 1957) and beyond (Jarzynski, 1997; Crooks, 1999), so that it can in particular properly treat overﬁtting cascading dynamics and time-dependent sample distributions.

ACKNOWLEDGMENTS

The author thanks Ludovic Berthier, L eon Bottou, Guy Gur-Ari, Kunihiko Kaneko, Ari Morcos, Dheevatsa Mudigere, Yann Ollivier, Yuandong Tian, and Mark Tygert for discussions. Special thanks go to Daniel Adam Roberts who prompted the practical application of the ﬂuctuationdissipation relations, leading to the adaptive method in Section 3.3.

Jing An, Jianfeng Lu, and Lexing Ying. Stochastic modiﬁed equations for the asynchronous stochastic gradient descent. ar Xiv preprint ar Xiv:1805.08244, 2018.

L eon Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8): 12, 1991.

L eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018.

Robert Brown. XXVII. A brief account of microscopical observations made in the months of June, July and August 1827, on the particles contained in the pollen of plants; and on the general existence of active molecules in organic and inorganic bodies. The Philosophical Magazine, 4 (21):161 173, 1828.

Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. ar Xiv preprint ar Xiv:1710.11029, 2017.

Gavin E Crooks. Entropy production ﬂuctuation theorem and the nonequilibrium work relation for free energy differences. Physical Review E, 60(3):2721 2726, 1999.

Albert Einstein. Uber die von der molekularkinetischen Theorie der W arme geforderte Bewegung von in ruhenden Fl ussigkeiten suspendierten Teilchen. Annalen der physik, 322(8):549 560, 1905.

Crispin Gardiner. Stochastic methods, volume 4. Springer Berlin, 2009.

Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pp. 249 256, 2010.

Melville S Green. Markoff random processes and the statistical mechanics of time-dependent phenomena. II. Irreversible processes in ﬂuids. The Journal of Chemical Physics, 22(3):398 413, 1954.

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018.

Published as a conference paper at ICLR 2019

Christopher Jarzynski. Nonequilibrium equality for free energy differences. Physical Review Letters, 78(14):2690 2693, 1997.

Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors inﬂuencing minima in SGD. ar Xiv preprint ar Xiv:1711.04623, 2017.

Andrej Karpathy. Conv Net JS CIFAR-10 demo. https://cs.stanford.edu/people/ karpathy/convnetjs/demo/cifar10.html, 2014. Last accessed on 2018-09-25.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Ryogo Kubo. Statistical-mechanical theory of irreversible processes. I. General theory and simple applications to magnetic and conduction problems. Journal of the Physical Society of Japan, 12 (6):570 586, 1957.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Todd K Leen and John E Moody. Weight space probability densities in stochastic learning: I. Dynamics and equilibria. In Advances in Neural Information Processing Systems, pp. 451 458, 1993.

Chris Junchi Li, Lei Li, Junyang Qian, and Jian-Guo Liu. Batch size matters: A diffusion approximation framework on nonconvex stochastic gradient descent. ar Xiv preprint ar Xiv:1705.07562v1, 2017.

Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modiﬁed equations and adaptive stochastic gradient algorithms. ar Xiv preprint ar Xiv:1511.06251, 2015.

Stephan Mandt, Matthew D Hoffman, and David M Blei. Continuous-time limit of stochastic gradient descent revisited. In 8th NIPS Workshop on Optimization for Machine Learning, 2015.

Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate Bayesian inference. The Journal of Machine Learning Research, 18(1):4873 4907, 2017.

Ga etan Marceau-Caron and Yann Ollivier. Natural Langevin dynamics for neural networks. In International Conference on Geometric Science of Information, pp. 451 459. Springer, 2017.

Zachary Nado, Jasper Snoek, Roger Grosse, David Duvenaud, Bowen Xu, and James Martens. Stochastic gradient Langevin dynamics that exploit neural network structure, 2018. URL https: //openreview.net/forum?id=ry-Se9kv G.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv preprint ar Xiv:1412.6614, 2014.

Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pp. 5947 5956, 2017.

Gerardo Odriozola and Ludovic Berthier. Equilibrium equation of state of a hard sphere binary mixture at very large densities using replica exchange Monte Carlo simulations. The Journal of Chemical Physics, 134(5):054504, 2011.

Lars Onsager. Reciprocal relations in irreversible processes. I. Physical Review, 37(4):405 426, 1931.

G unter Radons, Heinz Georg Schuster, and D Werner. Fokker-Planck description of learning in backpropagation networks. In International Neural Network Conference, pp. 993 996. Springer, 1990.

Published as a conference paper at ICLR 2019

Hannes Risken. The Fokker-Planck equation: methods of solution and applications. Springer, 1984.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400 407, 1951.

Ludger Santen and Werner Krauth. Absence of thermodynamic phase transition in a model glass former. Nature, 405(6786):550 551, 2000.

Samuel L Smith and Quoc V Le. A Bayesian perspective on generalization and stochastic gradient descent. ar Xiv preprint ar Xiv:1710.06451, 2018.

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, and Nathan Srebro. The implicit bias of gradient descent on separable data. In International Conference on Learning Representations, 2018.

Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer. Consistency and ﬂuctuations for stochastic gradient Langevin dynamics. The Journal of Machine Learning Research, 17(1):193 225, 2016.

Nicolaas Godfried Van Kampen. Stochastic processes in physics and chemistry, volume 1. Elsevier, 1992.

Marian Von Smoluchowski. Zur kinetischen theorie der brownschen molekularbewegung und der suspensionen. Annalen der physik, 326(14):756 780, 1906.

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pp. 681 688, 2011.

Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from minima and regularization effects. ar Xiv preprint ar Xiv:1803.00195, 2018.

Published as a conference paper at ICLR 2019

A SGD WITH MOMENTUM AND DAMPENING

For SGD with momentum µ and dampening ν, the update equation is given by

v(t + 1) = µv(t) (1 ν) f B [θ(t)] , (19) θ(t + 1) = θ(t) + ηv(t + 1) . (20)

Here v = {vi}i=1,...,P is the velocity and η > 0 the learning rate; SGD without momentum is the special case with µ = 0. Again hypothesizing the existence of a stationary-state distribution pss (θ, v), the stationary-state average of an observable O (θ, v) is deﬁned as

O (θ, v) Z dθdv pss (θ, v) O (θ, v) . (21)

Just as in the main text, from the assumed stationarity follows the master equation for SGD with momentum and dampening

O (θ, v) = q O θ + η µv (1 ν) f B (θ) , µv (1 ν) f B (θ) y

m.b. . (22)

For the linear observables, v = µ v (1 ν) f (θ) (23)

and θ = θ + η [µ v (1 ν) f (θ) ] = θ + η v , (24)

thus v = 0 and f = 0 . (25)

For the quadratic observables

vivj = µ2 vivj + (1 ν)2 D e Ci,j E (1 ν)µ [ vi ( jf) + ( if) vj ] , (26)

viθj η vivj = µ viθj (1 ν) ( if) θj , (27)

and (1 ν) [ θi ( jf) + ( if) θj ] µ ( θivj + viθj ) = η vivj . (28)

Note that the relations (26) and (27) are trivially satisﬁed at each time step if the left-hand side observables are evaluated at one step ahead and thus their being satisﬁed for running averages has nothing to do with equilibration [the same can be said about the relation (23)]; the only nontrivial relation is the equation (28), which is a consequence of setting θiθj constant of time. After taking traces and some rearrangement, we obtain the relation (FDR1 ) in the main text.

For the full-batch loss function, the algebra similar to the one in the main text yields h (1 ν) D ( f)2E µ v f i

i,j=1 Hi,j n (1 ν)2 e Ci,j µ(1 ν) [vi ( jf) + ( if) vj] + µ2vivj o+

Published as a conference paper at ICLR 2019

B MODELS AND SIMULATION PROTOCOLS

B.1 MLP ON MNIST THROUGH SGD WITHOUT MOMENTUM

The MNIST training data consist of Ns = 60000 black-white images of hand-written digits with 28-by-28 pixels (Le Cun et al., 1998). We preprocess the data through an afﬁne transformation such that their mean and variance (over both the training data and pixels) are zero and one, respectively.

Our multilayer perceptron (MLP) consists of a 784-dimensional input layer followed by a hidden layer of 200 neurons with Re LU activations, another hidden layer of 200 neurons with Re LU activations, and a 10-dimensional output layer with the softmax activation. The model performance is evaluated by the cross-entropy loss supplemented by the L2-regularization term 1

2λθ2 with the weight decay λ = 0.01.

Throughout the paper, the MLP is trained on the MNIST data through SGD without momentum. The data are shufﬂed at each epoch with the mini-batch size |B| = 100.

The MLP is initialized through the Xavier method (Glorot & Bengio, 2010) and trained for ˆttotal epoch = 100 epochs with the learning rate η = 0.1. We then sequentially train it with (η, ˆttotal epoch) = (0.05, 500) (0.02, 500) (0.01, 500) (0.005, 1000) (0.003, 1000). This sequential-run protocol is carried out with 4 distinct seeds for the random-number generator used in data shufﬂing, all starting from the common model parameter attained at the end of the initial 100epoch run. The ﬁgure 2 depicts trajectories for one particular seed, while the ﬁgure 3 plots means and error bars over these distinct seeds.

B.2 CNN ON CIFAR-10 THROUGH SGD WITH MOMENTUM

The CIFAR-10 training data consist of Ns = 50000 color images of objects divided into ten categories with 32-by-32 pixels in each of 3 color channels, each pixel ranging in [0, 1] (Krizhevsky & Hinton, 2009). We preprocess the data through uniformly subtracting 0.5 and multiplying by 2 so that each pixel ranges in [ 1, 1].

In order to describe the architecture of our convolutional neural network (CNN) in detail, let us associate a tuple [F, C, S, P; M] to a convolutional layer with ﬁlter width F, a number of channels C, stride S, and padding P, followed by Re LU activations and a max-pooling layer of width M. Then, as in the demo at Karpathy (2014), our CNN consists of a (32, 32, 3) input layer followed by a convolutional layer with [5, 16, 1, 2; 2], another convolutional layer with [5, 20, 1, 2; 2], yet another convolutional layer with [5, 20, 1, 2; 2], and ﬁnally a fully-connected 10-dimensional output layer with the softmax activation. The model performance is evaluated by the cross-entropy loss supplemented by the L2-regularization term 1

2λθ2 with the weight decay λ = 0.01.

Throughout the paper (except in Section 3.3 where the adaptive scheduling method is tested for SGD without momentum), the CNN is trained on the CIFAR-10 data through SGD with momentum µ = 0.9 and dampening ν = 0. The data are shufﬂed at each epoch with the mini-batch size |B| = 100.

The CNN is initialized through the Xavier method (Glorot & Bengio, 2010) and trained for ˆttotal epoch = 100 epochs with the learning rate η = 0.1. We then sequentially train it with (η, ˆttotal epoch) = (0.05, 200) (0.02, 200) (0.01, 200) (0.005, 400) (0.003, 400) (0.002, 400) (0.0015, 400) (0.001, 400) (0.0005, 800) (0.00025, 800) (0.0001, 800). At each junction of the sequence, the velocity v is zeroed. This sequential-run protocol is carried out with 16 distinct seeds for the random-number generator used in data shufﬂing, all starting from the common model parameter attained at the end of the initial 100-epoch run. The ﬁgure 2 depicts trajectories for one particular seed, while the ﬁgure 3 plots means and error bars over these distinct seeds.

Published as a conference paper at ICLR 2019

C ADDITIONAL SIMULATIONS

C.1 ADAM VERSUS AMSGRAD

Plotted in the ﬁgure S1 are the comparisons between Adam (Kingma & Ba, 2014) and AMSGrad (J. Reddi et al., 2018) algorithms with the default hyperparameters α = 10 3, (β1, β2) = (0.9, 0.999), and ϵ = 10 8. The AMSGrad algorithm marginally outperforms the Adam algorithm for the tasks at hand and thus the results with the AMSGrad are presented in the main text.

0 100 200 300 400

(a) MLP on MNIST

0 100 200 300 400

(b) CNN on CIFAR-10

Figure S1: Comparison of AMSGrad (green) and Adam (orange) algorithms for the MLP on the MNIST data (a) and the CNN on the CIFAR-10 data (b). Top rows plot the full-batch training loss f while bottom rows plot prediction accuracies on the training-set images (solid) and the 10000 test-set images (dashed).

C.2 INITIAL ACCURACY GAIN WITH DIFFERENT SCHEDULING HYPERPARAMETERS

In the ﬁgure 4(a) for the MNIST classiﬁcation task with the MLP, the proposed adaptive method with the scheduling hyperparameters X = 0.01 and Y = 0.1 outperforms the AMSGrad algorithm in terms of accuracy attained at long time and also exhibits a quick initial convergence. In the ﬁgure 4(b) for the CIFAR-10 classiﬁcation task with the CNN, however, while the proposed adaptive method attains better accuracy at long time, its initial accuracy gain is visibly slower than the AMSGrad algorithm. This lag in initial accuracy gain can be ameliorated by choosing another combination of the scheduling hyperparameters, e.g., X = 0.1 and Y = 0.3, at the expense of degradation in generalization accuracy with respect to the original choice X = 0.01 and Y = 0.1. See the ﬁgure S2.

Published as a conference paper at ICLR 2019

0 100 200 300 400

0 100 200 300 400

0 100 200 300 400

0 100 200 300 400

Figure S2: Comparison of preset training schedule (black) and adaptive training schedule (purple) now with the scheduling hyperparameters X = 0.1 and Y = 0.3 employing SGD without momentum, and the AMSGrad algorithm (green), for the CNN on the CIFAR-10 data with the same initial seed as in the main text (a) and three different initial seeds (b-d). From top to bottom, plotted are the learning rate η, the full-batch training loss f, and prediction accuracies on the training-set images (solid) and the 10000 test-set images (dashed).