# entropysgd_biasing_gradient_descent_into_wide_valleys__7849f37e.pdf

Published as a conference paper at ICLR 2017

ENTROPY-SGD: BIASING GRADIENT DESCENT INTO WIDE VALLEYS

Pratik Chaudhari1, Anna Choromanska2, Stefano Soatto1, Yann Le Cun3,4, Carlo Baldassi5, Christian Borgs6, Jennifer Chayes6, Levent Sagun3, Riccardo Zecchina5

1 Computer Science Department, University of California, Los Angeles 2 Department of Electrical and Computer Engineering, New York University 3 Courant Institute of Mathematical Sciences, New York University 4 Facebook AI Research, New York 5 Dipartimento di Scienza Applicata e Tecnologia, Politecnico di Torino 6 Microsoft Research New England, Cambridge

Email: pratikac@ucla.edu, ac5455@nyu.edu, soatto@ucla.edu, yann@cs.nyu.edu, carlo.baldassi@polito.it,

borgs@microsoft.com, jchayes@microsoft.com, sagun@cims.nyu.edu, riccardo.zecchina@polito.it

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large ﬂat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent neural networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

1 INTRODUCTION

This paper presents a new optimization tool for deep learning designed to exploit the local geometric properties of the objective function. Consider the histogram we obtained in Fig. 1 showing the spectrum of the Hessian at an extremum discovered by Adam (Kingma & Ba, 2014) for a convolutional neural network on MNIST (Le Cun et al., 1998) ( 47,000 weights, cf. Sec. 5.1). It is evident that:

(i) a large number of directions ( 94%) have near-zero eigenvalues (magnitude less than 10 4),

(ii) positive eigenvalues (right inset) have a long tail with the largest one being almost 40,

(iii) negative eigenvalues (left inset), which are directions of descent that the optimizer missed, have a much faster decay (the largest negative eigenvalue is only 0.46).

Interestingly, this trend is not unique to this particular network. Rather, its qualitative properties are shared across a variety of network architectures, network sizes, datasets or optimization algorithms (refer to Sec. 5 for more experiments). Local minima that generalize well and are discovered by gradient descent lie in wide valleys of the energy landscape, rather than in sharp, isolated minima. For an intuitive understanding of this phenomenon, imagine a Bayesian prior concentrated at the minimizer of the expected loss, the marginal likelihood of wide valleys under this prior is much

Published as a conference paper at ICLR 2017

higher than narrow, sharp valleys even if the latter are close to the global minimum in training loss. Almost-ﬂat regions of the energy landscape are robust to data perturbations, noise in the activations, as well as perturbations of the parameters, all of which are widely-used techniques to achieve good generalization. This suggests that wide valleys should result in better generalization and, indeed, standard optimization algorithms in deep learning seem to discover exactly that without being explicitly tailored to do so. For another recent analysis of the Hessian, see the parallel work of Sagun et al. (2016).

10 5 0 5 10 ( 10 4) Eigenvalues

small-Le Net: Eigenspectrum of the Hessian at local minimum

0.5 0.4 0.3 0.2 0.1 0.0 Eigenvalues

104 Short negative tail

0 10 20 30 40 Eigenvalues

104 Long positive tail

Figure 1: Eigenspectrum of the Hessian at a local minimum of a CNN on MNIST (two independent runs). Remark: The central plot shows the eigenvalues in a small neighborhood of zero whereas the left and right insets show the entire tails of the eigenspectrum.

Based on this understanding of how the local geometry looks at the end of optimization, can we modify SGD to actively seek such regions? Motivated by the work of Baldassi et al. (2015) on shallow networks, instead of minimizing the original loss f(x), we propose to maximize

F(x,γ) = log Z

x Rn exp f(x ) γ

2 x x 2 2 dx .

The above is a log-partition function that measures both the depth of a valley at a location x Rn, and its ﬂatness through the entropy of f(x ); we call it local entropy in analogy to the free entropy used in statistical physics. The Entropy-SGD algorithm presented in this paper employs stochastic gradient Langevin dynamics (SGLD) to approximate the gradient of local entropy. Our algorithm resembles two nested loops of SGD: the inner loop consists of SGLD iterations while the outer loop updates the parameters. We show that the above modiﬁed loss function results in a smoother energy landscape deﬁned by the hyper-parameter γ which we can think of as a scope that seeks out valleys of speciﬁc widths. Actively biasing the optimization towards wide valleys in the energy landscape results in better generalization error. We present experimental results on fullyconnected and convolutional neural networks (CNNs) on the MNIST and CIFAR-10 (Krizhevsky, 2009) datasets and recurrent neural networks (RNNs) on the Penn Tree Bank dataset (PTB) (Marcus et al., 1993) and character-level text prediction. Our experiments show that Entropy-SGD scales to deep networks used in practice, obtains comparable generalization error as competitive baselines and also trains much more quickly than SGD (we get a 2x speed-up over SGD on RNNs).

2 RELATED WORK

Our above observation about the spectrum of Hessian (further discussed in Sec. 5) is similar to results on a perceptron model in Dauphin et al. (2014) where the authors connect the loss function of a deep network to a high-dimensional Gaussian random ﬁeld. They also relate to earlier studies such as Baldi & Hornik (1989); Fyodorov & Williams (2007); Bray & Dean (2007) which show that critical points with high training error are exponentially likely to be saddle points with many negative directions and all local minima are likely to have error that is very close to that of the global minimum. The authors also argue that convergence of gradient descent is affected by the proliferation of saddle points surrounded by high error plateaus as opposed to multiple local minima. One can also see this via an application of Kramer s law: the time spent by diffusion is

Published as a conference paper at ICLR 2017

inversely proportional to the smallest negative eigenvalue of the Hessian at a saddle point (Bovier & den Hollander, 2006).

The existence of multiple almost equivalent local minima in deep networks has been predicted using a wide variety of theoretical analyses and empirical observations, e.g., papers such as Choromanska et al. (2015a;b); Chaudhari & Soatto (2015) that build upon results from statistical physics as also others such as Haeffele & Vidal (2015) and Janzamin et al. (2015) that obtain similar results for matrix and tensor factorization problems. Although assumptions in these works are somewhat unrealistic in the context of deep networks used in practice, similar results are also true for linear networks which afford a more thorough analytical treatment (Saxe et al., 2014). For instance, Soudry & Carmon (2016) show that with mild over-parameterization and dropout-like noise, training error for a neural network with one hidden layer and piece-wise linear activation is zero at every local minimum. All these results suggest that the energy landscape of deep neural networks should be easy to optimize and they more or less hold in practice it is easy to optimize a prototypical deep network to near-zero loss on the training set (Hardt et al., 2015; Goodfellow & Vinyals, 2015).

Obtaining good generalization error, however, is challenging: complex architectures are sensitive to initial conditions and learning rates (Sutskever et al., 2013) and even linear networks (Kawaguchi, 2016) may have degenerate and hard to escape saddle points (Ge et al., 2015; Anandkumar & Ge, 2016). Techniques such as adaptive (Duchi et al., 2011) and annealed learning rates, momentum (Tieleman & Hinton, 2012), as well as architectural modiﬁcations like dropout (Srivastava et al., 2014), batch-normalization (Ioffe & Szegedy, 2015; Cooijmans et al., 2016), weight scaling (Salimans & Kingma, 2016) etc. are different ways of tackling this issue by making the underlying landscape more amenable to ﬁrst-order algorithms. However, the training process often requires a combination of such techniques and it is unclear beforehand to what extent each one of them helps.

Closer to the subject of this paper are results by Baldassi et al. (2015; 2016a;b) who show that the energy landscape of shallow networks with discrete weights is characterized by an exponential number of isolated minima and few very dense regions with lots of local minima close to each other. These dense local minima can be shown to generalize well for random input data; more importantly, they are also accessible by efﬁcient algorithms using a novel measure called robust ensemble that ampliﬁes the weight of such dense regions. The authors use belief propagation to estimate local entropy for simpler models such as committee machines considered there. A related work in this context is EASGD (Zhang et al., 2015) which trains multiple deep networks in parallel and modulates the distance of each worker from the ensemble average. Such an ensemble training procedure enables improved generalization by ensuring that different workers land in the same wide valley and indeed, it turns out to be closely related to the replica theoretic analysis of Baldassi et al. (2016a).

Our work generalizes the local entropy approach above to modern deep networks with continuous weights. It exploits the same phenomenon of wide valleys in the energy landscape but does so without incurring the hardware and communication complexity of replicated training or being limited to models where one can estimate local entropy using belief propagation. The enabling technique in our case is using Langevin dynamics for estimating the gradient of local entropy, which can be done efﬁciently even for large deep networks using mini-batch updates.

Motivated by the same ﬁnal goal, viz. ﬂat local minima, the authors in Hochreiter & Schmidhuber (1997b) introduce hard constraints on the training loss and the width of local minima and show using the Gibbs formalism (Haussler & Opper, 1997) that this leads to improved generalization. As the authors discuss, the effect of hyper-parameters for the constraints is intricately tied together and they are difﬁcult to choose even for small problems. Our local entropy based objective instead naturally balances the energetic term (training loss) and the entropic term (width of the valley). The role of γ is clear as a focusing parameter (cf. Sec. 4.3) and effectively exploiting this provides signiﬁcant computational advantages. Among other conceptual similarities with our work, let us note that local entropy in a ﬂat valley is a direct measure of the width of the valley which is similar to their usage of Hessian, while the Gibbs variant to averaging in weight space (Eqn. 33 of Hochreiter & Schmidhuber (1997b)) is similar to our Eqn. (7). Indeed, Gibbs formalism used in their analysis is a very promising direction to understanding generalization in deep networks (Zhang et al., 2016).

Published as a conference paper at ICLR 2017

Homotopy continuation methods convolve the loss function to solve sequentially reﬁned optimization problems (Allgower & Georg, 2012; Mobahi & Fisher III, 2015), similarly, methods that perturb the weights or activations to average the gradient (Gulcehre et al., 2016) do so with an aim to smooth the rugged energy landscape. Such smoothing is however very different from local entropy. For instance, the latter places more weight on wide local minima even if they are much shallower than the global minimum (cf. Fig. 2); this effect cannot be obtained by smoothing. In fact, smoothing can introduce an artiﬁcial minimum between two nearby sharp valleys which is detrimental to generalization. In order to be effective, continuation techniques also require that minimizers of successively smaller convolutions of the loss function lie close to each other (Hazan et al., 2016); it is not clear whether this is true for deep networks. Local entropy, on the other hand, exploits wide minima which have been shown to exist in a variety of learning problems (Monasson & Zecchina, 1995; Cocco et al., 1996). Please refer to Appendix C for a more elaborate discussion as well as possible connections to stochastic variational inference (Blei et al., 2016).

3 LOCAL ENTROPY

We ﬁrst provide a simple intuition for the concept of local entropy of an energy landscape. The discussion in this section builds upon the results of Baldassi et al. (2016a) and extends it for the case of continuous variables. Consider a cartoon energy landscape in Fig. 2 where the x-axis denotes the conﬁguration space of the parameters. We have constructed two local minima: a shallower although wider one at xrobust and a very sharp global minimum at xnon robust. Under a Bayesian prior on the parameters, say a Gaussian of a ﬁxed variance at locations xrobust and xnon robust respectively, the wider local minimum has a higher marginalized likelihood than the sharp valley on the right.

xnon-robust

Original landscape Negative local entropy : γ = 0.001 Negative local entropy : γ = 0.00005

Figure 2: Local entropy concentrates on wide valleys in the energy landscape.

The above discussion suggests that parameters that lie in wider local minima like xrobust, which may possibly have a higher loss than the global minimum, should generalize better than the ones that are simply at the global minimum. In a neighborhood of xrobust, local entropy as introduced in Sec. 1 is large because it includes the contributions from a large region of good parameters; conversely, near xnon robust, there are fewer such contributions and the resulting local entropy is low. The local entropy thus provides a way of picking large, approximately ﬂat, regions of the landscape over sharp, narrow valleys in spite of the latter possibly having a lower loss. Quite conveniently, the local entropy is also computed from the partition function with a local re-weighting term.

Formally, for a parameter vector x Rn, consider a Gibbs distribution corresponding to a given energy landscape f(x): P(x; β) = Z 1 β exp ( β f(x)); (1)

where β is known as the inverse temperature and Zβ is a normalizing constant, also known as the partition function. As β , the probability distribution above concentrates on the global minimum of f(x) (assuming it is unique) given as:

x = argmin x f(x), (2)

which establishes the link between the Gibbs distribution and a generic optimization problem (2). We would instead like the probability distribution and therefore the underlying optimization problem to focus on ﬂat regions such as xrobust in Fig. 2. With this in mind, let us construct a modiﬁed Gibbs distribution:

P(x ; x,β,γ) = Z 1 x,β, γ exp β f(x ) β γ

2 x x 2 2 . (3)

The distribution in (3) is a function of a dummy variable x and is parameterized by the original location x. The parameter γ biases the modiﬁed distribution (3) towards x; a large γ results in a

Published as a conference paper at ICLR 2017

P(x ; x,β,γ) with all its mass near x irrespective of the energy term f(x ). For small values of γ, the term f(x ) in the exponent dominates and the modiﬁed distribution is similar to the original Gibbs distribution in (1). We will set the inverse temperature β to 1 because γ affords us similar control on the Gibbs distribution.

Deﬁnition 1 (Local entropy). The local free entropy of the Gibbs distribution in (1), colloquially called local entropy in the sequel and denoted by F(x,γ), is deﬁned as the log-partition function of modiﬁed Gibbs distribution (3), i.e.,

F(x,γ) = log Zx,1,γ

x exp f(x ) γ

2 x x 2 2 dx . (4)

The parameter γ is used to focus the modiﬁed Gibbs distribution upon a local neighborhood of x and we call it a scope henceforth.

Effect on the energy landscape: Fig. 2 shows the negative local entropy F(x,γ) for two different values of γ. We expect xrobust to be more robust than xnon-robust to perturbations of data or parameters and thus generalize well and indeed, the negative local entropy in Fig. 2 has a global minimum near xrobust. For low values of γ, the energy landscape is signiﬁcantly smoother than the original landscape and still maintains our desired characteristic, viz. global minimum at a wide valley. As γ increases, the local entropy energy landscape gets closer to the original energy landscape and they become equivalent at γ . On the other hand, for very small values of γ, the local entropy energy landscape is almost uniform.

Connection to classical entropy: The quantity we have deﬁned as local entropy in Def. 1 is different from classical entropy which counts the number of likely conﬁgurations under a given distribution. For a continuous parameter space, this is given by

S(x,β,γ) = Z

x log P(x ; x,β,γ) d P(x ; x,β,γ)

for the Gibbs distribution in (3). Minimizing classical entropy however does not differentiate between ﬂat regions that have very high loss versus ﬂat regions that lie deeper in the energy landscape. For instance in Fig. 2, classical entropy is smallest in the neighborhood of xcandidate which is a large region with very high loss on the training dataset and is unlikely to generalize well.

4 ENTROPY-GUIDED SGD

Simply speaking, our Entropy-SGD algorithm minimizes the negative local entropy from Sec. 3. This section discusses how the gradient of local entropy can be computed via Langevin dynamics. The reader will see that the resulting algorithm has a strong ﬂavor of SGD-inside-SGD : the outer SGD updates the parameters, while an inner SGD estimates the gradient of local entropy.

Consider a typical classiﬁcation setting, let x Rn be the weights of a deep neural network and ξk Ξ be samples from a dataset Ξ of size N. Let f(x;ξk) be the loss function, e.g., cross-entropy of the classiﬁer on a sample ξk. The original optimization problem is:

x = argmin x 1 N

N k=1 f(x; ξk); (5)

where the objective f(x,ξk) is typically a non-convex function in both the weights x and the samples ξk. The Entropy-SGD algorithm instead solves the problem

x Entropy-SGD = argmin x F(x,γ; Ξ); (6)

where we have made the dependence of local entropy F(x,γ) on the dataset Ξ explicit.

Published as a conference paper at ICLR 2017

4.1 GRADIENT OF LOCAL ENTROPY

The gradient of local entropy over a randomly sampled mini-batch of m samples denoted by ξℓi Ξℓ

for i m is easy to derive and is given by

x F x,γ; Ξℓ = γ x D x ; ΞℓE ; (7)

where the notation denotes an expectation of its arguments (we have again made the dependence on the data explicit) over a Gibbs distribution of the original optimization problem modiﬁed to focus on the neighborhood of x; this is given by

P(x ; x,γ) exp

m i=1 f x ; ξℓi !

Computationally, the gradient in (7) involves estimating x ; Ξℓ with the current weights ﬁxed to x. This is an expectation over a Gibbs distribution and is hard to compute. We can however approximate it using Markov chain Monte-Carlo (MCMC) techniques. In this paper, we use stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) that is an MCMC algorithm for drawing samples from a Bayesian posterior and scales to large datasets using mini-batch updates. Please see Appendix A for a brief overview of SGLD. For our application, as lines 3-6 of Alg. 1 show, SGLD resembles a few iterations of SGD with a forcing term γ(x x ) and additive gradient noise.

We can obtain some intuition on how Entropy-SGD works using the expression for the gradient: the term x ; is the average over a locally focused Gibbs distribution and for two local minima in the neighborhood of x roughly equivalent in loss, this term points towards the wider one because x ; is closer to it. This results in a net gradient that takes SGD towards wider valleys. Moreover, if we unroll the SGLD steps used to compute (x x ; ) (cf. line 5 in Alg. 1), it resembles one large step in the direction of the (noisy) average gradient around the current weights x and Entropy-SGD thus looks similar to averaged SGD in the literature (Polyak & Juditsky, 1992; Bottou, 2012). These two phenomena intuitively explain the improved generalization performance of Entropy-SGD.

4.2 ALGORITHM AND IMPLEMENTATION DETAILS

Alg. 1 provides the pseudo-code for one iteration of the Entropy-SGD algorithm. At each iteration, lines 3-6 perform L iterations of Langevin dynamics to estimate µ = x ;Ξℓ . The weights x are updated with the modiﬁed gradient on line 7.

Algorithm 1: Entropy-SGD algorithm

Input : current weights x, Langevin iterations L Hyper-parameters: scope γ, learning rate η, SGLD step size η

// SGLD iterations;

2 for ℓ L do

3 Ξℓ sample mini-batch;

4 dx 1 m m i=1 x f x ; ξℓi γ (x x );

5 x x η dx + p

η ε N(0,I);

6 µ (1 α)µ +α x ;

// Update weights;

7 x x η γ (x µ)

Let us now discuss a few implementation details. Although we have written Alg. 1 in the classical SGD setup, we can easily modify it to include techniques such as momentum and gradient preconditioning (Duchi et al., 2011) by changing lines 5 and 7. In our experiments, we have used SGD with Nesterov s momentum (Sutskever et al., 2013) and Adam for outer and inner loops with similar qualitative results. We use exponential averaging to estimate µ in the SGLD loop (line 6)

Published as a conference paper at ICLR 2017

with α = 0.75 so as to put more weight on the later samples, this is akin to a burn-in in standard SGLD.

We set the number of SGLD iterations to L = [5,20] depending upon the complexity of the dataset. The learning rate η is ﬁxed for all our experiments to values between η [0.1,1]. We found that annealing η (for instance, setting it to be the same as the outer learning rate η) is detrimental; indeed a small learning rate leads to poor estimates of local entropy towards the end of training where they are most needed. The parameter ε in SGLD on line 5 is the thermal noise and we ﬁx this to ε [10 4,10 3]. Having thus ﬁxed L,ε and η , an effective heuristic to tune the remaining parameter γ is to match the magnitude of the gradient of the local entropy term, viz. γ (x µ), to the gradient for vanilla SGD, viz. m 1 m i=1 x f(x; ξℓi).

4.3 SCOPING OF γ

The scope γ is ﬁxed in Alg. 1. For large values of γ, the SGLD updates happen in a small neighborhood of the current parameters x while low values of γ allow the inner SGLD to explore further away from x. In the context of the discussion in Sec. 3, a reverse-annealing of the scope γ, i.e. increasing γ as training progresses has the effect of exploring the energy landscape on progressively ﬁner scales. We call this process scoping which is similar to that of Baldassi et al. (2016a) and use a simple exponential schedule given by

γ(t) = γ0 (1+γ1)t ;

for the tth parameter update. We have experimented with linear, quadratic and bounded exponential (γ0 (1 e τt)) scoping schedules and obtained qualitatively similar results.

Scoping of γ unfortunately interferes with the learning rate annealing that is popular in deep learning, this is a direct consequence of the update step on line 7 of Alg. 1. In practice, we therefore scale down the local entropy gradient by γ before the weight update and modify the line to read

x x η(x µ).

Our general strategy during hyper-parameter tuning is to set the initial scope γ0 to be very small, pick a large value of η and set γ1 to be such that the magnitude of the local entropy gradient is comparable to that of SGD. We can use much larger learning rates than SGD in our experiments because the local entropy gradient is less noisy than the original back-propagated gradient. This also enables very fast progress in the beginning with a smooth landscape of a small γ.

4.4 THEORETICAL PROPERTIES

We can show that Entropy-SGD results in a smoother loss function and obtains better generalization error than the original objective (5). With some overload of notation, we assume that the original loss f(x) is β-smooth, i.e., for all x,y Rn, we have f(x) f(y) β x y . We additionally assume for the purpose of analysis that no eigenvalue of the Hessian 2 f(x) lies in the set [ 2γ c,c] for some small c > 0. Lemma 2. The objective F(x,γ; Ξ) in (6) is α 1+γ 1 c-Lipschitz and β 1+γ 1 c-smooth.

Please see Appendix B for the proof. The local entropy objective is thus smoother than the original objective. Let us now obtain a bound on the improvement in generalization error. We denote an optimization algorithm, viz., SGD or Entropy-SGD by A(Ξ), it is a function of the dataset Ξ and outputs the parameters x upon termination. Stability of the algorithm (Bousquet & Elisseeff, 2002) is then a notion of how much its output differs in loss upon being presented with two datasets, Ξ and Ξ , that differ in at most one sample:

f (A(Ξ),ξ) f A(Ξ ),ξ ε.

Hardt et al. (2015) connect uniform stability to generalization error and show that an ε-stable algorithm A(Ξ) has generalization error bounded by ε, i.e., if A(Ξ) terminates with parameters x ,

|EΞ (RΞ(x ) R(x ))| ε;

Published as a conference paper at ICLR 2017

where the left hand side is the generalization error: it is the difference between the empirical loss RΞ(x) := N 1 N k=1 f(x,ξk) and the population loss R(x) := Eξ f(x,ξ). We now employ the following theorem that bounds the stability of an optimization algorithm through the smoothness of its loss function and the number of iterations on the training set.

Theorem 3 (Hardt et al. (2015)). For an α-Lipschitz and β-smooth loss function, if SGD converges in T iterations on N samples with decreasing learning rate ηt 1/t the stability is bounded by

N α1/(1+β) T 1 1/(1+β).

Using Lemma 2 and Theorem 3 we have

ε Entropy-SGD α T 1 1 1 1+γ 1c

β ε SGD, (9)

which shows that Entropy-SGD generalizes better than SGD for all T > α if they both converge after T passes over the samples.

Let us note that while the number of passes over the dataset for Entropy-SGD and SGD are similar for our experiments on CNNs, Entropy-SGD makes only half as many passes as SGD for our experiments on RNNs. As an aside, it is easy to see from the proof of Lemma 2 that for a convex loss function f(x), the local entropy objective does not change the minimizer of the original problem.

Remark 4. The above analysis hinges upon an assumption that the Hessian 2 f(x) does not have eigenvalues in the set [ 2γ c,c] for a constant c > 0. This is admittedly unrealistic, for instance, the eigenspectrum of the Hessian at a local minimum in Fig. 1 has a large fraction of its eigenvalues almost zero. Let us however remark that the result in Thm. 3 by Hardt et al. (2015) assumes global conditions on the smoothness of the loss function; one imagines that Eqn. 9 remains qualitatively the same (with respect to T in particular) even if this assumption is violated to an extent before convergence happens. Obtaining a rigorous generalization bound without this assumption would require a dynamical analysis of SGD and seems out of reach currently.

5 EXPERIMENTS

In Sec. 5.1, we discuss experiments that suggest that the characteristics of the energy landscape around local minimal accessible by SGD are universal to deep architectures. We then present experimental results on two standard image classiﬁcation datasets, viz. MNIST and CIFAR-10 and two datasets for text prediction, viz. PTB and the text of War and Peace. Table 1 summarizes the results of these experiments on deep networks.

5.1 UNIVERSALITY OF THE HESSIAN AT LOCAL MINIMA

We use automatic differentiation1 to compute the Hessian at a local minimum obtained at the end of training for the following networks:

(i) small-Le Net on MNIST: This network has 47,658 parameters and is similar to Le Net but with 10 and 20 channels respectively in the ﬁrst two convolutional layers and 128 hidden units in the fully-connected layer. We train this with Adam to obtain a test error of 2.4%.

(ii) small-mnistfc on MNIST: A fully-connected network (50,890 parameters) with one layer of 32 hidden units, Re LU non-linearities and cross-entropy loss; it converges to a test error of 2.5% with momentum-based SGD.

(iii) char-lstm for text generation: This is a recurrent network with 48 hidden units and Long Short-Term Memory (LSTM) architecture (Hochreiter & Schmidhuber, 1997a). It has 32,640 parameters and we train it with Adam to re-generate a small piece of text consisting of 256 lines of length 32 each and 96-bit one-hot encoded characters.

1https://github.com/HIPS/autograd

Published as a conference paper at ICLR 2017

(iv) All-CNN-BN on CIFAR-10: This is similar to the All-CNN-C network (Springenberg et al.,

2014) with 1.6 million weights (cf. Sec. 5.3) which we train using Adam to obtain an error of 11.2%. Exact Hessian computation is in this case expensive and thus we instead compute the diagonal of the Fisher information matrix (Wasserman, 2013) using the element-wise ﬁrst and second moments of the gradients that Adam maintains, i.e., diag(I) = E(g2) (E g)2

where g is the back-propagated gradient. Fisher information measures the sensitivity of the log-likelihood of data given parameters in a neighborhood of a local minimum and thus is exactly equal to the Hessian of the negative log-likelihood. We will consider the diagonal of the empirical Fisher information matrix as a proxy for the eigenvalues of the Hessian, as is common in the literature.

We choose to compute the exact Hessian and to keep the computational and memory requirements manageable, the ﬁrst three networks considered above are smaller than standard deep networks used in practice. For the last network, we sacriﬁce the exact computation and instead approximate the Hessian of a large deep network. We note that recovering an approximate Hessian from Hessianvector products (Pearlmutter, 1994) could be a viable strategy for large networks.

Fig. 1 in the introductory Sec. 1 shows the eigenspectrum of the Hessian for small-Le Net while Fig. 3 shows the eigenspectra for the other three networks. A large proportion of eigenvalues of the Hessian are very close to zero or positive with a very small (relative) magnitude. This suggests that the local geometry of the energy landscape is almost ﬂat at local minima discovered by gradient descent. This agrees with theoretical results such as Baldassi et al. (2016c) where the authors predict that ﬂat regions of the landscape generalize better. Standard regularizers in deep learning such as convolutions, max-pooling and dropout seem to bias SGD towards ﬂatter regions in the energy landscape. Away from the origin, the right tails of the eigenspectra are much longer than the left tails. Indeed, as discussed in numerous places in literature (Bray & Dean, 2007; Dauphin et al., 2014; Choromanska et al., 2015a), SGD ﬁnds low-index critical points, i.e., optimizers with few negative eigenvalues of the Hessian. What is interesting and novel is that the directions of descent that SGD misses do not have a large curvature.

We consider two prototypical networks: the ﬁrst, called mnistfc, is a fully-connected network with two hidden layers of 1024 units each and the second is a convolutional neural network with the same size as Le Net but with batch-normalization (Ioffe & Szegedy, 2015); both use a dropout of probability 0.5 after each layer. As a baseline, we train for 100 epochs with Adam and a learning rate of 10 3 that drops by a factor of 5 after every 30 epochs to obtain an average error of 1.39 0.03% and 0.51 0.01% for mnistfc and Le Net respectively, over 5 independent runs.

For both these networks, we train Entropy-SGD for 5 epochs with L = 20 and reduce the dropout probability (0.15 for mnistfc and 0.25 for Le Net). The learning rate of the SGLD updates is ﬁxed to η = 0.1 while the outer loop s learning rate is set to η = 1 and drops by a factor of 10 after the second epoch; we use Nesterov s momentum for both loops. The thermal noise in SGLD updates (line 5 of Alg. 1) is set to 10 3. We use an exponentially increasing value of γ for scoping, the initial value of the scope is set to γ = 10 4 and this increases by a factor of 1.001 after each parameter update. The results in Fig. 4a and Fig. 4b show that Entropy-SGD obtains a comparable generalization error: 1.37 0.03% and 0.50 0.01%, for mnistfc and Le Net respectively. While Entropy-SGD trains slightly faster in wall-clock time for Le Net; it is marginally slower for mnistfc which is a small network and trains in about two minutes anyway.

Remark on the computational complexity: Since Entropy-SGD runs L steps of SGLD before each parameter update, the effective number of passes over the dataset is L times that of SGD or Adam for the same number of parameter updates. We therefore plot the error curves of Entropy-SGD in Figs. 4, 5, and 6 against the effective number of epochs , i.e. by multiplying the abscissae by a factor of L. (we deﬁne L = 1 for SGD or Adam). Modulo the time required for the actual parameter updates (which are fewer for Entropy-SGD), this is a direct measure of wall-clock time, agnostic to the underlying hardware and implementation.

Published as a conference paper at ICLR 2017

0.25 1 10 50 Eigenvalues

small-mnistfc: Hessian eigenspectrum at local minimum

(a) small-mnistfc (2 runs): Peak (clipped here) at zero (|λ| 10 2) accounts for 90% of the entries.

10 5 15 25 ( 10 2) Eigenvalues

char-lstm: Hessian eigenspectrum at local minimum

(b) char-LSTM (5 runs): Almost 95% eigenvalues have absolute value below 10 5.

300 200 100 0 ( 10 5) Eigenvalues

Negative tail

0.02 0.5 1.0 1.5 2.0 Eigenvalues

Positive tail

(c) Negative and positive eigenvalues of the Fisher information matrix of All-CNN-BN at a local minimum (4 independent runs). The origin has a large peak with 95% near-zero (|λ| 10 5) eigenvalues (clipped here).

Figure 3: Universality of the Hessian: for a wide variety of network architectures, sizes and datasets, optima obtained by SGD are mostly ﬂat (large peak near zero), they always have a few directions with large positive curvature (long positive tails). A very small fraction of directions have negative curvature, and the magnitude of this curvature is extremely small (short negative tails).

20 40 60 80 100 120 Epochs L

Adam Entropy-SGD

(a) mnistfc: Validation error

20 40 60 80 100 Epochs L

Adam Entropy-SGD

(b) Le Net: Validation error

Figure 4: Comparison of Entropy-SGD vs. Adam on MNIST

Published as a conference paper at ICLR 2017

5.3 CIFAR-10

We train on CIFAR-10 without data augmentation after performing global contrast normalization and ZCA whitening (Goodfellow et al., 2013). We consider the All-CNN-C network of Springenberg et al. (2014) with the only difference that a batch normalization layer is added after each convolutional layer; all other architecture and hyper-parameters are kept the same. We train for 200 epochs with SGD and Nesterov s momentum during which the initial learning rate of 0.1 decreases by a factor of 5 after every 60 epochs. We obtain an average error of 7.71 0.19% in 200 epochs vs. 9.08% error in 350 epochs that the authors in Springenberg et al. (2014) report and this is thus a very competitive baseline for this network. Let us note that the best result in the literature on non-augmented CIFAR-10 is the ELU-network by Clevert et al. (2015) with 6.55% test error.

We train Entropy-SGD with L = 20 for 10 epochs with the original dropout of 0.5. The initial learning rate of the outer loop is set to η = 1 and drops by a factor of 5 every 4 epochs, while the learning rate of the SGLD updates is ﬁxed to η = 0.1 with thermal noise ε = 10 4. As the scoping scheme, we set the initial value of the scope to γ0 = 0.03 which increases by a factor of 1.001 after each parameter update. Fig. 5 shows the training and validation error curves for Entropy-SGD compared with SGD. It shows that local entropy performs as well as SGD on a large CNN; we obtain a validation error of 7.81 0.09% in about 160 effective epochs.

We see almost no plateauing of training loss or validation error for Entropy-SGD in Fig. 5a; this trait is shared across different networks and datasets in our experiments and is an indicator of the additional smoothness of the local entropy landscape coupled with a good scoping schedule for γ.

0 50 100 150 200 Epochs L

Cross-Entropy Loss

0.0353 0.0336

SGD Entropy-SGD

(a) All-CNN-BN: Training loss

0 50 100 150 200 Epochs L

SGD Entropy-SGD

(b) All-CNN-BN: Validation error

Figure 5: Comparison of Entropy-SGD vs. SGD on CIFAR-10

5.4 RECURRENT NEURAL NETWORKS

5.4.1 PENN TREE BANK

We train an LSTM network on the Penn Tree Bank (PTB) dataset for word-level text prediction. This dataset contains about one million words divided into a training set of about 930,000 words, a validation set of 74,000 words and 82,000 words with a vocabulary of size 10,000. Our network called PTB-LSTM consists of two layers with 1500 hidden units, each unrolled for 35 time steps; note that this is a large network with about 66 million weights. We recreated the training pipeline of Zaremba et al. (2014) for this network (SGD without momentum) and obtained a word perplexity of 81.43 0.2 on the validation set and 78.6 0.26 on the test set with this setup; these numbers closely match the results of the original authors.

We run Entropy-SGD on PTB-LSTM for 5 epochs with L = 5, note that this results in only 25 effective epochs. We do not use scoping for this network and instead ﬁx γ = 10 3. The initial learning rate of the outer loop is η = 1 which reduces by a factor of 10 at each epoch after the

Published as a conference paper at ICLR 2017

third epoch. The SGLD learning rate is ﬁxed to η = 1 with ε = 10 4. We obtain a word perplexity of 80.116 0.069 on the validation set and 77.656 0.171 on the test set. As Fig. 6a shows, Entropy-SGD trains signiﬁcantly faster than SGD (25 effective epochs vs. 55 epochs of SGD) and also achieves a slightly better generalization perplexity.

5.4.2 CHAR-LSTM ON WAR AND PEACE

Next, we train an LSTM to perform character-level text-prediction. As a dataset, following the experiments of Karpathy et al. (2015), we use the text of War and Peace by Leo Tolstoy which contains about 3 million characters divided into training (80%), validation (10%) and test (10%) sets. We use an LSTM consisting of two layers of 128 hidden units unrolled for 50 time steps and a vocabulary of size 87. We train the baseline with Adam for 50 epochs with an initial learning rate of 0.002 that decays by a factor of 2 after every 5 epochs to obtain a validation perplexity of 1.224 0.008 and a test perplexity of 1.226 0.01.

As noted in Sec. 4.2, we can easily wrap Alg. 1 inside other variants of SGD such as Adam; this involves simply substituting the local entropy gradient in place of the usual back-propagated gradient. For this experiment, we constructed Entropy-Adam which is equivalent to Adam with the local entropy gradient (which is computed via SGLD). We run Entropy-Adam for 5 epochs with L = 5 and a ﬁxed γ = 0.01 with an initial learning rate of 0.01 that decreases by a factor of 2 at each epoch. Note that this again results in only 25 effective epochs, i.e. half as much wall-clock time as SGD or Adam. We obtain a validation perplexity of 1.213 0.007 and a test perplexity of 1.217 0.005 over 4 independent runs which is better than the baseline. Fig. 6b shows the error curves for this experiment.

0 10 20 30 40 50 Epochs L

(Test :78.6)

81.43 (Test :77.656)

SGD Entropy-SGD

(a) PTB-LSTM: Validation, test perplexity

0 10 20 30 40 50 Epochs L

(Test :1.226)

1.224 (Test :1.217)

Adam Entropy-Adam

(b) char-LSTM: Validation, test perplexity

Figure 6: Comparison of Entropy-SGD vs. SGD / Adam on RNNs

Model Entropy-SGD SGD / Adam

Error (%) / Perplexity Epochs Error (%) / Perplexity Epochs

mnistfc 1.37 0.03 120 1.39 0.03 66

Le Net 0.5 0.01 80 0.51 0.01 100

All-CNN-BN 7.81 0.09 160 7.71 0.19 180

PTB-LSTM 77.656 0.171 25 78.6 0.26 55

char-LSTM 1.217 0.005 25 1.226 0.01 40

Table 1: Experimental results: Entropy-SGD vs. SGD / Adam

Published as a conference paper at ICLR 2017

Tuning the momentum in Entropy-SGD was crucial to getting good results on RNNs. While the SGD baseline on PTB-LSTM does not use momentum (and in fact, does not train well with momentum) we used a momentum of 0.5 for Entropy-SGD. On the other hand, the baseline for char-LSTM was trained with Adam with β1 = 0.9 (β1 in Adam controls the moving average of the gradient) while we set β1 = 0.5 for Entropy-Adam. In contrast to this observation about RNNs, all our experiments on CNNs used a momentum of 0.9. In order to investigate this difﬁculty, we monitored the angle between the local entropy gradient and the vanilla SGD gradient during training. This angle changes much more rapidly for RNNs than for CNNs which suggests a more rugged energy landscape for the former; local entropy gradient is highly uncorrelated with the SGD gradient in this case.

6 DISCUSSION

In our experiments, Entropy-SGD results in generalization error comparable to SGD, but always with lower cross-entropy loss on the training set. This suggests the following in the context of energy landscapes of deep networks. Roughly, wide valleys favored by Entropy-SGD are located deeper in the landscape with a lower empirical loss than local minima discovered by SGD where it presumably gets stuck. Such an interpretation is in contrast to theoretical models of deep networks (cf. Sec. 2) which predict multiple equivalent local minima with the same loss. Our work suggests that geometric properties of the energy landscape are crucial to generalize well and provides algorithmic approaches to exploit them. However, the literature lacks general results about the geometry of the loss functions of deep networks convolutional neural networks in particular and this is a promising direction for future work.

If we focus on the inner loop of the algorithm, SGLD updates compute the average gradient (with Langevin noise) in a neighborhood of the parameters while maintaining the Polyak average of the new parameters. Such an interpretation is very close to averaged SGD of Polyak & Juditsky (1992) and Bottou (2012) and worth further study. Our experiments show that while Entropy-SGD trains signiﬁcantly faster than SGD for recurrent networks, it gets relatively minor gains in terms of wallclock time for CNNs. Estimating the gradient of local entropy cheaply with few SGLD iterations, or by using a smaller network to estimate it in a teacher-student framework (Balan et al., 2015) is another avenue for extensions to this work.

7 CONCLUSIONS

We introduced an algorithm named Entropy-SGD for optimization of deep networks. This was motivated from the observation that the energy landscape near a local minimum discovered by SGD is almost ﬂat for a wide variety of deep networks irrespective of their architecture, input data or training methods. We connected this observation to the concept of local entropy which we used to bias the optimization towards ﬂat regions that have low generalization error. Our experiments showed that Entropy-SGD is applicable to large convolutional and recurrent deep networks used in practice.

8 ACKNOWLEDGMENTS

This work was supported by ONR N00014-13-1-034, AFOSR F9550-15-1-0229 and ARO W911NF-15-1-0564/66731-CS.

E. L. Allgower and K. Georg. Numerical continuation methods: an introduction, volume 13. Springer, 2012.

A. Anandkumar and R. Ge. Efﬁcient approaches for escaping higher order saddle points in non-convex optimization. ar Xiv:1602.05908, 2016.

A. Balan, V. Rathod, K. Murphy, and M. Welling. Bayesian dark knowledge. In NIPS, 2015.

Published as a conference paper at ICLR 2017

C. Baldassi, C. Borgs, J. T. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. PNAS, 113(48):E7655 E7662, 2016a.

C. Baldassi, F. Gerace, C. Lucibello, L. Saglietti, and R. Zecchina. Learning may need only a few bits of synaptic precision. Physical Review E, 93(5):052313, 2016b.

C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina. Local entropy as a measure for sampling solutions in constraint satisfaction problems. Journal of Statistical Mechanics: Theory and Experiment, 2016(2):023301, 2016c.

C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Physical review letters, 115(12):128101, 2015.

P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2:53 58, 1989.

D. Blei, A. Kucukelbir, and J. Mc Auliffe. Variational inference: A review for statisticians. ar Xiv:1601.00670, 2016.

L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pp. 421 436. Springer, 2012.

O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2(Mar):499 526, 2002.

A. Bovier and F. den Hollander. Metastability: A potential theoretic approach. In International Congress of Mathematicians, volume 3, pp. 499 518, 2006.

A. Bray and D. Dean. The statistics of critical points of Gaussian ﬁelds on large-dimensional spaces. Physics Review Letter, 2007.

P. Chaudhari and S. Soatto. On the energy landscape of deep networks. ar Xiv:1511.06485, 2015.

C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin. Bridging the gap between stochastic gradient MCMC and stochastic optimization. ar Xiv:1512.07962, 2015.

T. Chen, E. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. In ICML, 2014.

A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. Le Cun. The loss surfaces of multilayer networks. In AISTATS, 2015a.

A. Choromanska, Y. Le Cun, and G. Ben Arous. Open problem: The landscape of the loss surfaces of multilayer networks. In COLT, 2015b.

D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (ELUs). ar Xiv:1511.07289, 2015.

S. Cocco, R. Monasson, and R. Zecchina. Analytical and numerical study of internal representations in multilayer neural networks with binary weights. Physical Review E, 54(1):717, 1996.

T. Cooijmans, N. Ballas, C. Laurent, and A. Courville. Recurrent batch normalization. ar Xiv:1603.09025, 2016.

Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS, 2014.

N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using stochastic gradient thermostats. In NIPS, 2014.

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121 2159, 2011.

Y. Fyodorov and I. Williams. Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity. Journal of Statistical Physics, 129(5-6),1081-1116, 2007.

Z. Gan, C. Li, C. Chen, Y. Pu, Q. Su, and L. Carin. Scalable Bayesian Learning of Recurrent Neural Networks for Language Modeling. ar Xiv:1611.08034, 2016.

Published as a conference paper at ICLR 2017

R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. In COLT, 2015.

I. Goodfellow and O. Vinyals. Qualitatively characterizing neural network optimization problems. In ICLR, 2015.

I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. ICML, 2013.

C. Gulcehre, M. Moczulski, F. Visin, and Y. Bengio. Mollifying networks. ar Xiv:1608.04980, 2016.

B. Haeffele and R. Vidal. Global optimality in tensor factorization, deep learning, and beyond. ar Xiv:1506.07540, 2015.

M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. ar Xiv:1509.01240, 2015.

D. Haussler and M. Opper. Mutual information, metric entropy and cumulative relative entropy risk. The Annals of Statistics, 25(6):2451 2492, 1997.

E. Hazan, K. Levy, and S. Shalev-Shwartz. On graduated optimization for stochastic non-convex problems. In ICML, 2016.

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997a.

S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1 42, 1997b.

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv:1502.03167, 2015.

M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods. ar Xiv:1506.08473, 2015.

A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. ar Xiv:1506.02078, 2015.

K. Kawaguchi. Deep learning without poor local minima. In NIPS, 2016.

D. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014.

A. Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, Computer Science, University of Toronto, 2009.

Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Y.-A. Ma, T. Chen, and E. Fox. A complete recipe for stochastic gradient MCMC. In NIPS, 2015.

S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. ar Xiv:1602.02666, 2016.

M. Marcus, M.-A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313 330, 1993.

H. Mobahi. Training Recurrent Neural Networks by Diffusion. ar Xiv:1601.04114, 2016.

H. Mobahi and J. Fisher III. On the link between Gaussian homotopy continuation and convex envelopes. In Workshop on Energy Minimization Methods in CVPR, pp. 43 56. Springer, 2015.

R. Monasson and R. Zecchina. Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. Physical review letters, 75(12):2432, 1995.

R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2:113 162, 2011.

B. A. Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147 160, 1994.

B. Polyak and A. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992.

Published as a conference paper at ICLR 2017

G. Roberts and O. Stramer. Langevin diffusions and Metropolis-Hastings algorithms. Methodology and computing in applied probability, 4(4):337 357, 2002.

L. Sagun, L. Bottou, and Y. Le Cun. Singularity of the Hessian in Deep Learning. ar Xiv:1611:07476, 2016.

T. Salimans and D. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. ar Xiv:1602.07868, 2016.

A. Saxe, J. Mc Clelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR, 2014.

D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. ar Xiv:1605.08361, 2016.

J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. ar Xiv:1412.6806, 2014.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. JMLR, 15(1):1929 1958, 2014.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.

T. Tieleman and G. Hinton. Lecture 6.5: Rms Prop, Coursera: Neural networks for machine learning. Technical report, 2012.

L. Wasserman. All of statistics: A concise course in statistical inference. Springer, 2013.

M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011.

W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. ar Xiv:1409.2329, 2014.

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. ar Xiv:1611.03530, 2016.

S. Zhang, A. E. Choromanska, and Y. Le Cun. Deep learning with elastic averaging SGD. In NIPS, 2015.

Published as a conference paper at ICLR 2017

A STOCHASTIC GRADIENT LANGEVIN DYNAMICS (SGLD)

Local entropy in Def. (1) is an expectation over the entire conﬁguration space x Rn and is hard to compute; we can however approximate its gradient using Markov chain Monte-Carlo (MCMC) techniques. In this section, we brieﬂy review stochastic gradient Langevin dynamics (Welling & Teh, 2011) that is an MCMC algorithm designed to draw samples from a Bayesian posterior and scales to large datasets using mini-batch updates.

For a parameter vector x Rn with a prior distribution p(x) and if the probability of generating a data item ξk given a model parameterized by x is p(ξk |x), the posterior distribution of the parameters based on N data items can be written as

p(x|ξ k N) p(x) N k=1 p(ξk |x). (10)

Langevin dynamics (Neal, 2011) injects Gaussian noise into maximum-a-posteriori (MAP) updates to prevent over-ﬁtting the solution x of the above equation. The updates can be written as

log p(xt)+ N k=1 p(ξk |xt)

+ η εt; (11)

where εt N(0,ε2) is Gaussian noise and η is the learning rate. In this form, Langevin dynamics faces two major hurdles for applications to large datasets. First, computing the gradient N k=1 p(ξk |xt) over all samples for each update xt becomes prohibitive. However, as Welling & Teh (2011) show, one can instead simply use the average gradient over m data samples (mini-batch) as follows:

log p(xt)+ N

m k=1 p(ξk |xt)

+ ηt εt. (12)

Secondly, Langevin dynamics in (11) is the discrete-time approximation of a continuous-time stochastic differential equation (Mandt et al., 2016) thereby necessitating a Metropolis-Hastings (MH) rejection step (Roberts & Stramer, 2002) which again requires computing p(ξk |x) over the entire dataset. However, if the learning rate ηt 0, we can also forgo the MH step (Chen et al., 2014). Welling & Teh (2011) also argue that the sequence of samples xt generated by updating (12) converges to the correct posterior (10) and one can hence compute the statistics of any function g(x) of the parameters using these samples. Concretely, the posterior expectation E[g(x)] is given by

E[g(x)] t s=1 ηt g(xt)

t s=1 ηt ; which is the average computed by weighing each sample by the corresponding learning rate in (12). In this paper, we will consider a uniform prior on the parameters x and hence the ﬁrst term in (12), viz., log p(xt) vanishes.

Let us note that there is a variety of increasingly sophisticated MCMC algorithms applicable to our problem, e.g., Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) by Chen et al. (2014) based on volume preserving ﬂows in the parameter-momentum space, stochastic annealing thermostats (Santa) by Chen et al. (2015) etc. We can also employ these techniques, although we use SGLD for ease of implementation; the authors in Ma et al. (2015) provide an elaborate overview.

Proof of Lemma 2. The gradient F(x) is computed in Sec. 4.1 to be γ x x ; Ξℓ . Consider the term

x x ; x = x Z 1 x,γ

x x e f(x ) γ

s (x+s) e f(x) f(x) s 1

2 s (γ+ 2 f(x))s ds

= x 1 Z 1 x,γ

s e f(x) f(x) s 1

2 s (γ+ 2 f(x))s ds Z 1 x,γ

s s e f(x) f(x) s 1

2 s (γ+ 2 f(x))s ds

= Z 1 x,γ e f(x) Z

s s e f(x) s 1

2 s (γ+ 2 f(x))s ds.

Published as a conference paper at ICLR 2017

The above expression is the mean of a distribution e f(x) s 1

2 s (γ+ 2 f(x))s. We can approximate it using the saddle point method as the value of s that minimizes the exponent to get

x x ; x 2 f(x)+γ I 1 f(x).

Let us denote A(x) := I +γ 1 2 f(x) 1. Plugging this into the condition for smoothness, we have

F(x,γ) F(y,γ) = A(x) f(x) A(y) f(y)

sup x A(x) β x y .

Unfortunately, we can only get a uniform bound if we assume that for a small constant c > 0, no eigenvalue of 2 f(x) lies in the set [ 2γ c,c]. This gives sup x A(x) 1 1+γ 1 c.

This shows that a smaller value of γ results in a smoother energy landscape, except at places with very ﬂat directions. The Lipschitz constant also decreases by the same factor.

C CONNECTION TO VARIATIONAL INFERENCE

The fundamental motivations of (stochastic) variational inference (SVI) and local entropy are similar: they both aim to generalize well by constructing a distribution on the weight space. In this section, we explore whether they are related and how one might reconcile the theoretical and algorithmic implications of the local entropy objective with that of SVI.

Let Ξ denote the entire dataset, z denote the weights of a deep neural network and x be the parameters of a variational distribution qx(z). The Evidence Lower Bound (ELBO) can be then be written as

log p(Ξ) Ez qx(z) [log p(Ξ | z)] KL(qx(z) || p(z)); (13)

where p(z) denotes a parameter-free prior on the weights and controls, through their KL-divergence, how well the posited posterior qx(z) ﬁts the data. Stochastic variational inference involves maximizing the right hand side of the above equation with respect to x after choosing a suitable prior p(z) and a family of distributions qx(z). These choices are typically dictated by the ease of sampling z qx(z), e.g. a mean-ﬁeld model where qx(z) factorizes over z, and being able to compute the KL-divergence term, e.g. a mixture of Gaussians.

On the other hand, if we deﬁne the loss as the log-likelihood of data, viz. f(z) := log p(Ξ|z), we can write the logarithm of the local entropy in Eqn. (4) as

log F(x,γ) = log Z

z Rn exp h f(z; Ξ) γ

2 x z 2i dz,

h log p(Ξ | z) γ

2 x z 2i dz; (14)

by an application of Jensen s inequality. It is thus clear that Eqn. (13) and (14) are very different in general and one cannot choose a prior, or a variational family, that makes them equivalent and interpret local entropy as ELBO.

Eschewing rigor, formally, if we modify Eqn. (13) to allow the prior p(z) to depend upon x, we can see that the two lower bounds above are equivalent iff qx(z) belongs to a ﬂat variational family , i.e. uniform distributions with x as the mean and px(z) exp γ

2 x z 2 . We emphasize that the distribution px(z) depends on the parameters x themselves and is thus, not really a prior, or one that can be derived using the ELBO.

This moving prior is absent in variational inference and indeed, a crucial feature of the local entropy objective. The gradient of local entropy in Eqn. (7) clariﬁes this point:

F(x,γ) = γ (x z; Ξ ) = γ Ez r(z;x) [z];

Published as a conference paper at ICLR 2017

where the distribution r(z;x) is given by

r(z; x) p(Ξ | z) exp γ

it thus contains a data likelihood term along with a prior that moves along with the current iterate x.

Let us remark that methods in the deep learning literature that average the gradient through perturbations in the neighborhood of x (Mobahi, 2016) or noisy activation functions (Gulcehre et al., 2016) can be interpreted as computing the data likelihood in ELBO (without the KL-term); such an averaging is thus different from local entropy.

C.1 COMPARISON WITH SGLD

We use stochastic gradient Langevin dynamics (cf. Appendix A) to estimate the gradient of local entropy in Alg. 1. It is natural then, to ask the question whether vanilla SGLD performs as well as local entropy. To this end, we compare the performance of SGLD on two prototypical networks: Le Net on MNIST and All-CNN-BN on CIFAR-10. We follow the experiments in Welling & Teh (2011) and Chen et al. (2015) and set the learning rate schedule to be η/(1 +t)b where the initial learning rate η and b are hyper-parameters. We make sure that other architectural aspects (dropout, batch-normalization) and regularization (weight decay) are consistent with the experiments in Sec. 5.

After a hyper-parameter search, we obtained a test error on Le Net of 0.63 0.1% after 300 epochs and 9.89 0.11% on All-CNN-BN after 500 epochs. Even if one were to disregard the slow convergence of SGLD, its generalization error is much worse than our experimental results; we get 0.50 0.01% on Le Net and 7.81 0.09% on All-CNN-BN with Entropy-SGD. For comparison, the authors in Chen et al. (2015) report 0.71% error on MNIST on a slightly larger network. Our results with local entropy on RNNs are much better than those reported in Gan et al. (2016) for SGLD. On the PTB dataset, we obtain a test perplexity of 77.656 0.171 vs. 94.03 for the same model whereas we obtain a test perplexity of 1.213 0.007 vs. 1.3375 for char-LSTM on the War and Peace dataset.

Training deep networks with SGLD, or other more sophisticated MCMC algorithms such as SGHMC, SGNHT etc. (Chen et al., 2014; Neal, 2011) to errors similar to those of SGD is difﬁcult, and the lack of such results in the literature corroborates our experimental experience. Roughly speaking, local entropy is so effective because it operates on a transformation of the energy landscape that exploits entropic effects. Conventional MCMC techniques such as SGLD or Nose -Hoover thermostats (Ding et al., 2014) can only trade energy for entropy via the temperature parameter which does not allow the direct use of the geometric information of the energy landscape and does not help with narrow minima.