# practical_gaussnewton_optimisation_for_deep_learning__d1aad1b1.pdf

Practical Gauss-Newton Optimisation for Deep Learning

Aleksandar Botev 1 Hippolyt Ritter 1 David Barber 1 2

Abstract We present an efﬁcient block-diagonal approximation to the Gauss-Newton matrix for feedforward neural networks. Our resulting algorithm is competitive against state-of-the-art ﬁrst-order optimisation methods, with sometimes signiﬁcant improvement in optimisation performance. Unlike ﬁrst-order methods, for which hyperparameter tuning of the optimisation parameters is often a laborious process, our approach can provide good performance even when used with default settings. A side result of our work is that for piecewise linear transfer functions, the network objective function can have no differentiable local maxima, which may partially explain why such transfer functions facilitate effective optimisation.

1. Introduction

First-order optimisation methods are the current workhorse for training neural networks. They are easy to implement with modern automatic differentiation frameworks, scale to large models and datasets and can handle noisy gradients such as encountered in the typical mini-batch setting (Polyak, 1964; Nesterov, 1983; Kingma & Ba, 2014; Duchi et al., 2011; Zeiler, 2012). However, a suitable initial learning rate and decay schedule need to be selected in order for them to converge both rapidly and towards a good local minimum. In practice, this usually means many separate runs of training with different settings of those hyperparameters, requiring access to either ample compute resources or plenty of time. Furthermore, pure stochastic gradient descent often struggles to escape valleys in the error surface with largely varying magnitudes of curvature, as the ﬁrst derivative does not capture this information (Dauphin et al., 2014; Martens & Sutskever, 2011). Modern alternatives, such as ADAM (Kingma & Ba, 2014), combine

1University College London, London, United Kingdom 2Alan Turing Institute, London, United Kingdom. Correspondence to: Aleksandar Botev <a.botev@cs.ucl.ac.uk>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

the gradients at the current setting of the parameters with various heuristic estimates of the curvature from previous gradients.

Second-order methods, on the other hand, perform updates of the form δ = H 1g, where H is the Hessian or some approximation thereof and g is the gradient of the error function. Using curvature information enables such methods to make more progress per step than techniques relying solely on the gradient. Unfortunately, for modern neural networks, explicit calculation and storage of the Hessian matrix is infeasible. Nevertheless, it is possible to efﬁciently calculate Hessian-vector products Hg by use of extended Automatic Differentiation (Schraudolph, 2002; Pearlmutter, 1994); the linear system g = Hv can then be solved for v, e.g. by using conjugate gradients (Martens, 2010; Martens & Sutskever, 2011). Whilst this can be effective, the number of iterations required makes this process uncompetitive against simpler ﬁrst-order methods (Sutskever et al., 2013).

In this work, we make the following contributions:

We develop a recursive block-diagonal approximation of the Hessian, where each block corresponds to the weights in a single feedforward layer. These blocks are Kronecker factored and can be efﬁciently computed and inverted in a single backward pass.

As a corollary of our recursive calculation of the Hessian, we note that for networks with piecewise linear transfer functions the error surface has no differentiable strict local maxima.

We discuss the relation of our method to KFAC (Martens & Grosse, 2015), a block-diagonal approximation to the Fisher matrix. KFAC is less generally applicable since it requires the network to deﬁne a probabilistic model on its output. Furthermore, for non-exponential family models, the Gauss-Newton and Fisher approaches are in general different.

On three standard benchmarks we demonstrate that (without tuning) second-order methods perform competitively, even against well-tuned state-of-the-art ﬁrst-order methods.

Practical Gauss-Newton Optimisation for Deep Learning

2. Properties of the Hessian

As a basis for our approximations to the Gauss-Newton matrix, we ﬁrst describe how the diagonal Hessian blocks of feedforward networks can be recursively calculated. Full derivations are given in the supplementary material.

2.1. Feedforward Neural Networks

A feedforward neural network takes an input vector a0 = x and produces an output vector h L on the ﬁnal (Lth) layer of the network:

hλ = Wλaλ 1; aλ = fλ(hλ) 1 λ < L (1)

where hλ is the pre-activation in layer λ and aλ are the activation values; Wλ is the matrix of weights and fλ the elementwise transfer function1. We deﬁne a loss E(h L, y) between the output h L and a desired training output y (for example squared loss (h L y)2) which is a function of all parameters of the network θ = h vec (W1)T, vec (W2)T, . . . , vec (WL)Ti T . For a training

dataset with empirical distribution p(x, y), the total error function is then deﬁned as the expected loss E(θ) = E [E]p(x,y). For simplicity we denote by E(θ) the loss for a generic single datapoint (x, y).

2.2. The Hessian

A central quantity of interest in this work is the parameter Hessian, H, which has elements:

θi θj E(θ) (2)

The expected parameter Hessian is similarly given by the expectation of this equation. To emphasise the distinction between the expected Hessian and the Hessian for a single datapoint (x, y), we also refer to the single datapoint Hessian as the sample Hessian.

2.2.1. BLOCK DIAGONAL HESSIAN

The full Hessian, even of a moderately sized neural network, is computationally intractable due to the large number of parameters. Nevertheless, as we will show, blocks of the sample Hessian can be computed efﬁciently. Each block corresponds to the second derivative with respect to the parameters Wλ of a single layer λ. We focus on these blocks since the Hessian is in practice typically blockdiagonal dominant (Martens & Grosse, 2015).

The gradient of the error function with respect to the weights of layer λ can be computed by recursively applying

1The usual bias bλ in the equation for hλ is absorbed into Wλ by appending a unit term to every aλ 1.

the chain rule:2

E W λ a,b = X

hλ i W λ a,b

E hλ i = aλ 1 b E hλa (3)

Differentiating again we ﬁnd that the sample Hessian for layer λ is:

[Hλ](a,b),(c,d) 2E W λ a,b W λ c,d (4)

= aλ 1 b aλ 1 d [Hλ]a,c (5)

where we deﬁne the pre-activation Hessian for layer λ as:

[Hλ]a,b = 2E hλa hλ b (6)

We can re-express (5) in matrix form for the sample Hessian of Wλ:

Hλ = 2E vec (Wλ) vec (Wλ) = aλ 1a T λ 1 Hλ (7)

where denotes the Kronecker product3.

2.2.2. BLOCK HESSIAN RECURSION

In order to calculate the sample Hessian, we need to evaluate the pre-activation Hessian ﬁrst. This can be computed recursively as (see Appendix A):

Hλ = BλW T λ+1Hλ+1Wλ+1Bλ + Dλ (8)

where we deﬁne the diagonal matrices:

Bλ = diag (f λ(hλ)) (9)

Dλ = diag f λ(hλ) E

and f λ and f λ are the ﬁrst and second derivatives of fλ respectively.

The recursion is initialised with HL, which depends on the objective function E(θ) and is easily calculated analytically for the usual objectives4. Then we can simply apply the recursion (8) and compute the pre-activation Hessian for each layer using a single backward pass through the network. A similar observation is given in (Schaul et al., 2013), but restricted to the diagonal entries of the Hessian

2Generally we use a Greek letter to indicate a layer and a Roman letter to denote an element within a layer. We use either subor super-scripts wherever most notationally convenient and compact. 3Using the notation { }i,j as the i, j matrix block, the Kronecker Product is deﬁned as {A B}i,j = aij B. 4For example for squared loss (y h L)2/2, the pre-activation Hessian is simply the identity matrix HL = I.

Practical Gauss-Newton Optimisation for Deep Learning

0 10 20 30 40 0

40 0 10 20 30 40

Figure 1. Two layer network with Re LU and square loss. (a) The objective function E as we vary W1(x, y) along two randomly chosen direction matrices U and V , giving W1(x, y) = x U + y V , (x, y) R2. (b) E as a function of two randomly chosen directions within W2. (c) E for varying jointly W1 = x U, W2 = y V . The surfaces contain no smooth local maxima.

rather than the more general block-diagonal case. Given the pre-activation Hessian, the Hessian of the parameters for a layer is given by (7). For more than a single datapoint, the recursion is applied per datapoint and the parameter Hessian is given by the average of the individual sample Hessians.

2.2.3. NO DIFFERENTIABLE LOCAL MAXIMA

In recent years piecewise linear transfer functions, such as the Re LU functionf(x) = max(x, 0), have become popular5. Since the second derivative f of a piecewise linear function is zero everywhere, the matrices Dλ in (8) will be zero (away from non-differentiable points).

It follows that if HL is Positive Semi-Deﬁnite (PSD), which is the case for the most commonly used loss functions, the pre-activation matrices are PSD for every layer. A corollary is that if we ﬁx all of the parameters of the network except for Wλ the objective function is locally convex with respect to Wλ wherever it is twice differentiable. Hence, there can be no local maxima or saddle points of the objective with respect to the parameters within a layer6. Note that this does not imply that the objective is convex everywhere with respect to Wλ as the surface will contain ridges along which it is not differentiable, corresponding to boundary points where the transfer function changes regimes, see Figure 1(c).

As the trace of the full Hessian H is the sum of the traces of the diagonal blocks, it must be non-negative and thus it is not possible for all eigenvalues to be simultaneously negative. This implies that for feedforward neural networks

5Note that, for piecewise linear f, E is not necessarily piecewise linear in θ. 6This excludes any pathological regions where the objective function has zero curvature.

with piecewise linear transfer functions there can be no differentiable local maxima - that is, outside of pathological constant regions, all maxima (with respect to the full parameter set θ) must lie at the boundary points of the nonlinear activations and be sharp , see Figure 1. Additionally, for transfer functions with zero gradient f = 0, Hλ will have lower rank than Hλ+1, reducing the curvature information propagating from the output layer back up the network. This suggests that it is advantageous to use piecewise linear transfer functions with non-zero gradients, such as max(0.1x, x).

We state and prove these results more formally in Appendix E.

3. Approximate Gauss-Newton Method

Besides being intractable for large neural networks, the Hessian is not guaranteed to be PSD. A Newton update H 1g could therefore lead to an increase in the error. A common PSD approximation to the Hessian is the Gauss Newton (GN) matrix. For an error E(h L(θ)), the sample Hessian is given by:

2E θi θj = X

2h L k θi θj + X

2E h L k h L l

h L l θj (11)

Assuming that HL is PSD, the GN method forms a PSD approximation by neglecting the ﬁrst term in (11). This can be written in matrix notation as:

G Jh L θ THLJh L θ (12)

where Jh L θ is the Jacobian of the network outputs with respect to the parameters. The expected GN matrix is the average of (12) over the datapoints:

G E h Jh L θ THLJh L θ i

p(x,y) (13)

Practical Gauss-Newton Optimisation for Deep Learning

Whilst (13) shows how to calculate the GN matrix exactly, in practice we cannot feasibly store the matrix in this raw form. To proceed, similar to the Hessian, we will make a block diagonal approximation. As we will show, as for the Hessian itself, even a block diagonal approximation is computationally infeasible, and additional approximations are required. Before embarking on this sequence of approximations, we ﬁrst show that the GN matrix can be expressed as the expectation of a Khatri-Rao product, i.e. blocks of Kronecker products, corresponding to the weights of each layer. We will subsequently approximate the expectation of the Kronecker products as the product of the expectations of the factors, making the blocks efﬁciently invertible.

3.1. The GN Matrix as a Khatri-Rao Product

Using the deﬁnition of G in (13) and the chain rule, we can write the block of the matrix corresponding to the parameters in layers λ and β as:

Gλ,β = E h Jhλ Wλ

THLJh L hβ Jhβ Wβ

where [Jh L hλ ]i,k h L k hλ i . Deﬁning Gλ,β as the pre-activation GN matrix between the λ and β layers pre-activation vectors:

Gλ,β = Jh L hλ

THLJh L hβ (15)

and using the fact that Jhλ Wλ = a T λ 1 I we obtain

Gλ,β = E h aλ 1a T β 1 Gλ,β i (16)

We can therefore write the GN matrix as the expectation of the Khatri-Rao product:

G = E [Q G] (17)

where the blocks of G are the pre-activation GN matrices Gλ,β as deﬁned in (16), and the blocks of Q are:

Qλ,β aλ 1a T β 1 (18)

3.2. Approximating the GN Diagonal Blocks

For simplicity, from here on we denote by Gλ the diagonal blocks of the sample GN matrix with respect to the weights of layer λ (dropping the duplicate index). Similarly, we drop the index for the diagonal blocks Qλ and Gλ of the corresponding matrices in (17), giving more compactly:

Gλ = Qλ Gλ (19)

The diagonal blocks of the expected GN Gλ are then given by E [Gλ]. Computing this requires evaluating a block diagonal matrix for each datapoint and accumulating the result. However, since the expectation of a Kronecker product is not necessarily Kronecker factored, one would need

to explicitly store the whole matrix Gλ to perform this accumulation. With D being the dimensionality of a layer, this matrix would have O(D4) elements. For D of the order of 1000, it would require several terabytes of memory to store Gλ. As this is prohibitively large, we seek an approximation for the diagonal blocks that is both efﬁcient to compute and store. The approach we take is the factorised approximation:

E [Gλ] E [Qλ] E [Gλ] (20)

Under this factorisation, the updates for each layer can be computed efﬁciently by solving a Kronecker product form linear system see the supplementary material. The ﬁrst factor E [Qλ] is simply the uncentered covariance of the activations:

N Aλ 1AT λ 1 (21)

where the nth column of the d n matrix Aλ 1 is the set of activations of layer λ 1 for datapoint n. The second factor E [Gλ], can be computed efﬁciently, as described below.

3.3. The Pre-Activation Recursion

Analogously to the block diagonal pre-activation Hessian recursion (8), a similar recursion can be derived for the preactivation GN matrix diagonal blocks:

Gλ = BλW T λ+1Gλ+1Wλ+1Bλ (22)

where the recursion is initialised with the Hessian of the output HL.

This highlights the close relationship between the preactivation Hessian recursion and the pre-activation GN recursion. Inspecting (8) and (22) we notice that the only difference in the recursion stems from terms containing the diagonal matrices Dλ. From (7) and (16) it follows that in the case of piecewise linear transfer functions, the diagonal blocks of the Hessian are equal to the diagonal blocks of the GN matrix7.

Whilst this shows how to calculate the sample preactivation GN blocks efﬁciently, from (20) we require the calculation of the expected blocks E [Gλ]. In principle, the recursion could be applied for every data point. However, this is impractical in terms of the computation time and a vectorised implementation would impose infeasible memory requirements. Below, we show that when the number of outputs is small, it is in fact possible to efﬁciently compute the exact expected pre-activation GN matrix diagonals. For the case of a large number of outputs, we describe a further approximation to E [Gλ] in Section 3.5.

7This holds only at points where the derivative exists.

Practical Gauss-Newton Optimisation for Deep Learning

3.4. Exact Low Rank Calculation of E [Gλ]

Many problems in classiﬁcation and regression deal with a relatively small number of outputs. This implies that the rank K of the output layer GN matrix GL is low. We use the square root representation:

k=1 Ck λCk λ T (23)

From (22) we then obtain the recursion:

Ck λ = BλW T λ+1Ck λ+1 (24)

This allows us to calculate the expectation as:

k Ck λCk λ T #

k Ck λ Ck λ T (25)

where we stack the column vectors Ck λ for each datapoint into a matrix Cλ k , analogous to (21). Since we need to store only the vectors Ck λ per datapoint, this reduces the memory requirement to K D N; for small K this is a computationally viable option. We call this method Kronecker Factored Low Rank (KFLR).

3.5. Recursive Approximation of E [Gλ]

For higher dimensional outputs, e.g. in autoencoders, rather than backpropagating a sample pre-activation GN matrix for every datapoint, we propose to simply pass the expected matrix through the network. This yields the nested expectation approximation of (22):

E [Gλ] E h BλW T λ+1 E [Gλ+1] Wλ+1Bλ i (26)

The recursion is initialised with the exact value E [GL]. The method will be referred to as Kronecker Factored Recursive Approximation (KFRA).

4. Related Work

Despite the prevalence of ﬁrst-order methods for neural network optimisation, there has been considerable recent interest in developing practical second-order methods, which we brieﬂy outline below.

Martens (2010) and Martens & Sutskever (2011) exploited the fact that full Gauss-Newton matrix-vector products can be computed efﬁciently using a form of automatic differentiation. This was used to approximately solve the linear system Gδ = f using conjugate gradients to ﬁnd the parameter update δ. Despite making good progress on a periteration basis, having to run a conjugate gradient descent optimisation at every iteration proved too slow to compete with well-tuned ﬁrst-order methods.

The closest related work to that presented here is the KFAC method (Martens & Grosse, 2015), in which the Fisher matrix is used as the curvature matrix. This is based on the output y of the network deﬁning a conditional distribution pθ(y|x) on the observation y, with a loss deﬁned as the KLdivergence between the empirical distribution p(y|x) and the network output distribution. The network weights are chosen to minimise the KL-divergence between the conditional output distribution and the data distribution. For example, deﬁning the network output as the mean of a ﬁxed variance Gaussian or a Bernoulli/Categorical distribution yields the common squared error and cross-entropy objectives respectively.

Analogously to our work, Martens & Grosse (2015) develop a block-diagonal approximation to the Fisher matrix. The Fisher matrix is another PSD approximation to the Hessian that is used in natural gradient descent (Amari, 1998). In general, the Fisher and GN matrices are different. However, for the case of pθ(y|x) deﬁning an exponential family distribution, the Fisher and GN matrices are equivalent, see Appendix C.3. As in our work, Martens & Grosse (2015) use a factorised approximation of the form (20). However, they subsequently approximate the expected Fisher blocks by drawing Monte Carlo samples of the gradients from the conditional distribution deﬁned by the neural network. As a result, KFAC is always an approximation to the GN pre-activation matrix, whereas our method can provide an exact calculation of E [G] in the low rank setting. See also Appendix C.4 for differences between our KFRA approximation and KFAC.

More generally, our method does not require any probabilistic model interpretation and is therefore more widely applicable than KFAC.

5. Experiments

We performed experiments8 training deep autoencoders on three standard grey-scale image datasets and classifying hand-written digits as odd or even. The datasets are:

MNIST consists of 60, 000 28 28 images of hand-written digits. We used only the ﬁrst 50, 000 images for training (since the remaining 10, 000 are usually used for validation).

CURVES contains 20, 000 training images of size 28 28 pixels of simulated hand-drawn curves, created by choosing three random points in the 28 28 pixel plane (see the supplementary material of (Hinton & Salakhutdinov, 2006) for details).

8Experiments were run on a workstation with a Titan Xp GPU and an Intel Xeon CPU E5-2620 v4 @ 2.10GHz.

Practical Gauss-Newton Optimisation for Deep Learning

FACES is an augmented version of the Olivetti faces dataset (Samaria & Harter, 1994) with 10 different images of 40 people. We follow (Hinton & Salakhutdinov, 2006) in creating a training set of 103, 500 images by choosing 414 random pairs of rotation angles ( 90 to 90 degrees) and scaling factors (1.4 to 1.8) for each of the 250 images for the ﬁrst 25 people and then subsampling to 25 25 pixels.

We tested the performance of second-order against ﬁrstorder methods and compared the quality of the different GN approximations. In all experiments we report only the training error, as we are interested in the performance of the optimiser rather than how the models generalise.

When using second-order methods, it is important in practice to adjust the unmodiﬁed update δ in order to dampen potentially over-conﬁdent updates. One of our central interests is to compare our approach against KFAC. We therefore followed (Martens & Grosse, 2015) as closely as possible, introducing damping in an analogous way. Details on the implementation are in Appendix B. We emphasise that throughout all experiments we used the default damping parameter settings, with no tweaking required to obtain acceptable performance9.

Additionally, as a form of momentum for the second-order methods, we compared the use of a moving average with a factor of 0.9 on the curvature matrices Gλ and Qλ to only estimating them from the current minibatch. We did not ﬁnd any beneﬁt in using momentum on the updates themselves; on the contrary this made the optimisation unstable and required clipping the updates. We therefore do not include momentum on the updates in our results.

All of the autoencoder architectures are inspired by (Hinton & Salakhutdinov, 2006). The layer sizes are D-1000-500250-30-250-500-1000-D, where D is the dimensionality of the input. The grey-scale values are interpreted as the mean parameter of a Bernoulli distribution and the loss is the binary cross-entropy on CURVES and MNIST, and square error on FACES.

5.1. Comparison to First-Order Methods

We investigated the performance of both KFRA and KFAC compared to popular ﬁrst-order methods. Four of the most prevalent gradient-based optimisers were considered Stochastic Gradient Descent, Nesterov Accelerated Gradient, Momentum and ADAM (Kingma & Ba, 2014). A common practice when using ﬁrst-order methods is to decrease the learning rate throughout the training procedure. For this reason we included an extra parameter T the de-

9Our damping parameters could be compared to the exponential decay parameters β1 and β2 in ADAM, which are typically left at their recommended default values.

cay period to each of the methods, halving the learning rate every T iterations. To ﬁnd the best ﬁrst-order method, we ran a grid search over these two hyperarameters10.

Each ﬁrst-order method was run for 40, 000 parameter updates for MNIST and CURVES and 160, 000 updates for FACES. This resulted in a total of 35 experiments and 1.4/5.6 million updates for each dataset per method. In contrast, the second-order methods did not require adjustment of any hyperparameters and were run for only 5, 000/20, 000 updates, as they converged much faster11. For the ﬁrst-order methods we found ADAM to outperform the others across the board and we consequently compared the second-order methods against ADAM only.

Figure 2 shows the performance of the different optimisers on all three datasets. We present progress both per parameter update, to demonstrate that the second-order optimisers effectively use the available curvature information, and per GPU wall clock time, as this is relevant when training a network in practice. For ADAM, we display the performance using the default learning rate 10 3 as well as the top performing combination of learning rate and decay period. To illustrate the sensitivity of ADAM to these hyperparameter settings (and how much can therefore be gained by parameter tuning) we also plot the average performance resulting from using the top 10 and top 20 settings.

Even after signiﬁcantly tuning the ADAM learning rate and decay period, the second-order optimisers outperformed ADAM out-of-the-box across all three datasets. In particular on the challenging FACES dataset, the optimisation was not only much faster when using second-order methods, but also more stable. On this dataset, ADAM appears to be highly sensitive to the learning rate and in fact diverged when run with the default learning rate of 10 3. In contrast to ADAM, the second-order optimisers did not get trapped in plateaus in which the error does not change signiﬁcantly.

In comparison to KFAC, KFRA showed a noticeable speedup in the optimisation both per-iteration and when measuring the wall clock time. Their computational cost for each update is equivalent in practice, which we discuss in detail in Appendix C.4. Thus, to validate that the advantage of KFRA over KFAC stems from the quality of its updates, we compared the alignment of the updates of each method with the exact Gauss-Newton update (using the slower Hessianfree approach; see Appendix F.2 for the ﬁgures). We found that KFRA tends to be more closely aligned with the exact Gauss-Newton update, which provides a possible explana-

10We varied the learning rate from 2 6 to 2 13 at every power of 2 and chose the decay period as one of {100%, 50%, 25%, 12.5%, 6.25%} of the number of updates. 11For fair comparison, all of the methods were implemented using Theano (Theano Development Team, 2016) and Lasagne (Dieleman et al., 2015).

Practical Gauss-Newton Optimisation for Deep Learning

Figure 2. Comparison of the objective function being optimised by KFRA, KFAC and ADAM on CURVES, FACES and MNIST. GPU benchmarks are in the ﬁrst row, progress per update in the second. The dashed line indicates the use of momentum on the curvature matrix for the second-order methods. Errors are averaged using a sliding window of ten.

tion for its better performance.

5.2. Non-Exponential Family Model

To compare our approximate Gauss-Newton method and KFAC in a setting where the Fisher and Gauss-Newton matrix are not equivalent, we designed an experiment in which the model distribution over y is not in the exponential family. The model is a mixture of two binary classiﬁers12:

p(y|h L) = σ(h L 1 )σ(h L 2 )yσ( h L 2 )1 y+

(1 σ(h L 1 ))σ(h L 3 )yσ( h L 3 )1 y (27)

We used the same architecture as for the encoding layers of the autoencoders D-1000-500-250-30-1, where D = 784 is the size of the input. The task of the experiment was to classify MNIST digits as even or odd. Our choice was motivated by recent interest in neural network mixture models (Eigen et al., 2013; Zen & Senior, 2014; van den Oord & Schrauwen, 2014; Shazeer et al., 2017); our mixture model is also appropriate for testing the performance of KFLR. Training was run for 40, 000 updates for ADAM with a grid search as in Section 5.1, and for 5, 000 updates for the second-order methods. The results are shown in Figure 3.

For the CPU, both per iteration and wall clock time the second-order methods were faster than ADAM; on the GPU, however, ADAM was faster per wall clock time. The value of the objective function at the ﬁnal parameter values was higher for second-order methods than for

12In this context σ(x) = (1 + exp( x)) 1.

ADAM. However, it is important to keep in mind that all methods achieved a nearly perfect cross-entropy loss of around 10 8. When so close to the minimum we expect the gradients and curvature to be very small and potentially dominated by noise introduced from the mini-batch sampling. Additionally, since the second-order methods invert the curvature, they are more prone to accumulating numerical errors than ﬁrst-order methods, which may explain this behaviour close to a minimum.

Interestingly, KFAC performed almost identically to KFLR, despite the fact that KFLR computes the exact preactivation Gauss-Newton matrix. This suggests that in the low-dimensional output setting, the beneﬁts from using the exact low-rank calculation are diminished by the noise and the rather coarse factorised Kronecker approximation.

6. Rank of the Empirical Curvature

The empirical success of second-order methods raises questions about the curvature of the error function of a neural network. As we show in Appendix D the Monte Carlo Gauss-Newton matrix rank is upper bounded by the rank of the last layer Hessian times the size of the minibatch. More generally, the rank is upper bounded by the rank of HL times the size of the data set. As modern neural networks commonly have millions of parameters, the exact Gauss-Newton matrix is usually severely underdetermined. This implies that the curvature will be zero in many directions. This phenomenon is particularly pro-

Practical Gauss-Newton Optimisation for Deep Learning

Figure 3. Comparative optimisation performance on an MNIST binary mixture-classiﬁcation model. We used momentum on the curvature matrix for all methods, as it stabilises convergence.

nounced for the binary classiﬁer in Section 5.2, where the rank of the output layer Hessian is one.

We can draw a parallel between the curvature being zero and standard techniques where the maximum likelihood problem is under-determined for small data sets. This explains why damping is so important in such situations, and its role goes beyond simply improving the numerical stability of the algorithm. Our results suggest that, whilst in practice the Gauss-Newton matrix provides curvature only in a limited parameter subspace, this still provides enough information to allow for relatively large parameter updates compared to gradient descent, see Figure 2.

7. Conclusion

We presented a derivation of the block-diagonal structure of the Hessian matrix arising in feedforward neural networks. This leads directly to the interesting conclusion that for networks with piecewise linear transfer functions and convex loss the objective has no differentiable local maxima. Furthermore, with respect to the parameters of a single layer, the objective has no differentiable saddle points. This may provide some partial insight into the success of such transfer functions in practice.

Since the Hessian is not guaranteed to be positive semideﬁnite, two common alternative curvature measures are the Fisher matrix and the Gauss-Newton matrix. Unfortunately, both are computationally infeasible and, similar to Martens & Grosse (2015), we therefore used a block diagonal approximation, followed by a factorised Kronecker approximation. Despite parallels with the Fisher approach, formally the two methods are different. Only in the special case of exponential family models are the Fisher and Gauss-Newton matrices equivalent; however, even for this case, the subsequent approximations used in the Fisher approach (Martens & Grosse, 2015) differ from ours. Indeed, we showed that for problems in which the network has a small number of outputs no additional approximations are required. Even on models where the Fisher and Gauss Newton matrices are equivalent, our experimental results suggest that our KFRA approximation performs marginally better than KFAC. As we demonstrated, this is possibly due to the updates of KFRA being more closely aligned with the exact Gauss-Newton updates than those of KFAC.

Over the past decade ﬁrst-order methods have been predominant for Deep Learning. Second-order methods, such as Gauss-Newton, have largely been dismissed because of their seemingly prohibitive computational cost and potential instability introduced by using mini-batches. Our results on comparing both the Fisher and Gauss-Newton approximate methods, in line with (Martens & Grosse, 2015), conﬁrm that second-order methods can perform admirably against even well-tuned state-of-the-art ﬁrst-order approaches, while not requiring any hyperparameter tuning.

In terms of wall clock time on a CPU, in our experiments, the second-order approaches converged to the minimum signiﬁcantly more quickly than state-of-the-art ﬁrst-order methods. When training on a GPU (as is common in practice), we also found that second-order methods can perform well, although the improvement over ﬁrst-order methods was more marginal. However, since second-order methods are much faster per update, there is the potential to further improve their practical utility by speeding up the most expensive computations, speciﬁcally solving linear systems on parallel compute devices.

Acknowledgements

We thank the reviewers for their valuable feedback and suggestions. We also thank Raza Habib, Harshil Shah and James Townsend for their feedback on earlier drafts of this paper. Finally, we are grateful to James Martens for helpful discussions on the implementation of KFAC.

Practical Gauss-Newton Optimisation for Deep Learning

Amari, S.-I. Natural Gradient Works Efﬁciently in Learning. Neural Computation, 10(2):251 276, 1998.

Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, pp. 2933 2941, 2014.

Dieleman, S., Schl uter, J., Raffel, C., Olson, E., Sønderby, S. K., Nouri, D., et al. Lasagne: First Release, August 2015.

Duchi, J., Hazan, E., and Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. The Journal of Machine Learning Research, 12: 2121 2159, 2011.

Eigen, D., Ranzato, M., and Sutskever, I. Learning Factored Representations in a Deep Mixture of Experts. ar Xiv preprint ar Xiv:1312.4314, 2013.

Gower, R. M. and Gower, A. L. Higher-Order Reverse Automatic Differentiation with Emphasis on the Third Order. Mathematical Programming, 155(1-2):81 103, 2016.

Hinton, G. E. and Salakhutdinov, R. R. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504 507, 2006.

Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Martens, J. Deep Learning via Hessian-Free Optimization. In Proceedings of the 27th International Conference on Machine Learning, pp. 735 742, 2010.

Martens, J. New Insights and Perspectives on the Natural Gradient Method. ar Xiv preprint ar Xiv:1412.1193, 2014.

Martens, J. and Grosse, R. B. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In Proceedings of the 32nd International Conference on Machine Learning, pp. 2408 2417, 2015.

Martens, J. and Sutskever, I. Learning Recurrent Neural Networks with Hessian-Free Optimization. In Proceedings of the 28th International Conference on Machine Learning, pp. 1033 1040, 2011.

Nesterov, Y. A Method of Solving a Convex Programming Problem with Convergence Rate O (1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1983.

Pearlmutter, B. A. Fast Exact Multiplication by the Hessian. Neural Computation, 6(1):147 160, 1994.

Polyak, B. T. Some Methods of Speeding up the Convergence of Iteration Methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

Samaria, F. S. and Harter, A. C. Parameterisation of a Stochastic Model for Human Face Identiﬁcation. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision, pp. 138 142. IEEE, 1994.

Schaul, T., Zhang, S., and Le Cun, Y. No More Pesky Learning Rates. In Proceedings of the 30th International Conference on Machine Learning, pp. 343 351, 2013.

Schraudolph, N. N. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent. Neural Computation, 14(7):1723 1738, 2002.

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ar Xiv preprint ar Xiv:1701.06538, 2017.

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the Importance of Initialization and Momentum in Deep Learning. In Proceedings of the 30th International Conference on Machine Learning, pp. 1139 1147, 2013.

Theano Development Team. Theano: A Python Framework for Fast Computation of Mathematical Expressions. ar Xiv preprint ar Xiv:1605.02688, 2016.

van den Oord, A. and Schrauwen, B. Factoring Variations in Natural Images with Deep Gaussian Mixture Models. In Advances in Neural Information Processing Systems, pp. 3518 3526, 2014.

Zeiler, M. D. Adadelta: An Adaptive Learning Rate Method. ar Xiv preprint ar Xiv:1212.5701, 2012.

Zen, H. and Senior, A. Deep Mixture Density Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3844 3848. IEEE, 2014.