# backpack_packing_more_into_backprop__c1ab1d8a.pdf

Published as a conference paper at ICLR 2020

BACKPACK: PACKING MORE INTO BACKPROP

Felix Dangel University of Tuebingen fdangel@tue.mpg.de

Frederik Kunstner University of Tuebingen kunstner@cs.ubc.ca

Philipp Hennig University of Tuebingen and MPI for Intelligent Systems, Tuebingen ph@tue.mpg.de

Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient. Yet, other quantities such as the variance of the mini-batch gradients or many approximations to the Hessian can, in theory, be computed efﬁciently, and at the same time as the gradient. While these quantities are of great interest to researchers and practitioners, current deep-learning software does not support their automatic calculation. Manually implementing them is burdensome, inefﬁcient if done na ıvely, and the resulting code is rarely shared. This hampers progress in deep learning, and unnecessarily narrows research to focus on gradient descent and its variants; it also complicates replication studies and comparisons between newly developed methods that require those quantities, to the point of impossibility. To address this problem, we introduce BACKPACK1, an efﬁcient framework built on top of PYTORCH, that extends the backpropagation algorithm to extract additional information from ﬁrstand second-order derivatives. Its capabilities are illustrated by benchmark reports for computing additional quantities on deep neural networks, and an example application by testing several recent curvature approximations for optimization.

1 INTRODUCTION

The success of deep learning and the applications it fuels can be traced to the popularization of automatic differentiation frameworks. Packages like TENSORFLOW (Abadi et al., 2016), CHAINER (Tokui et al., 2015), MXNET (Chen et al., 2015), and PYTORCH (Paszke et al., 2019) provide efﬁcient implementations of parallel, GPU-based gradient computations to a wide range of users, with elegant syntactic sugar.

However, this specialization also has its shortcomings: it assumes the user only wants to compute gradients or, more precisely, the average of gradients across a mini-batch of examples. Other quantities can also be computed with automatic differentiation at a comparable cost or minimal overhead to the gradient backpropagation pass; for example, approximate second-order information or the variance of gradients within the batch. These quantities are valuable to understand the geometry of deep neural networks, for the identiﬁcation of free parameters, and to push the development of more efﬁcient optimization algorithms. But researchers who want to investigate their use face a chickenand-egg problem: automatic differentiation tools required to go beyond standard gradient methods are not available, but there is no incentive for their implementation in existing deep-learning software as long as no large portion of the users need it.

Second-order methods for deep learning have been continuously investigated for decades (e.g., Becker & Le Cun, 1989; Amari, 1998; Bordes et al., 2009; Martens & Grosse, 2015). But still, the standard optimizers used in deep learning remain some variant of stochastic gradient descent (SGD); more complex methods have not found wide-spread, practical use. This is in stark contrast to domains like convex optimization and generalized linear models, where second-order methods are

Equal contributions 1https://f-dangel.github.io/backpack/

Published as a conference paper at ICLR 2020

the default. There may of course be good scientiﬁc reasons for this difference; maybe second-order methods do not work well in the (non-convex, stochastic) setting of deep learning. And the computational cost associated with the high dimensionality of deep models may offset their beneﬁts. Whether these are the case remains somewhat unclear though, because a much more direct road-block is that these methods are so complex to implement that few practitioners ever try them out.

Recent approximate second-order methods such as KFAC (Martens & Grosse, 2015) show promising results, even on hard deep learning problems (Tsuji et al., 2019). Their approach, based on the earlier work of Schraudolph (2002), uses the structure of the network to compute approximate secondorder information in a way that is similar to gradient backpropagation. This work sparked a new line of research to improve the second-order approximation (Grosse & Martens, 2016; Botev et al., 2017; Martens et al., 2018; George et al., 2018). However, all of these methods require low-level applications of automatic differentiation to compute quantities other than the averaged gradient. It is a daunting task to implement them from scratch. Unless users spend signiﬁcant time familiarizing themselves with the internals of their software tools, the resulting implementation is often inefﬁcient, which also puts the original usability advantage of those packages into question. Even motivated researchers trying to develop new methods, who need not be expert software developers, face this problem. They often end up with methods that cannot compete in runtime, not necessarily because the method is inherently bad, but because the implementation is not efﬁcient. New methods are also frequently not compared to their predecessors and competitors because they are so hard to reproduce. Authors do not want to represent the competition in an unfair light caused by a bad implementation.

Another example is offered by a recent string of research to adapt to the stochasticity induced by mini-batch sampling. An empirical estimate of the (marginal) variance of the gradients within the batch has been found to be theoretically and practically useful for adapting hyperparameters like learning rates (Mahsereci & Hennig, 2017) and batch sizes (Balles et al., 2017), or regularize ﬁrstorder optimization (Le Roux et al., 2007; Balles & Hennig, 2018; Katharopoulos & Fleuret, 2018). To get such a variance estimate, one simply has to square, then sum, the individual gradients after the backpropagation, but before they are aggregated to form the average gradient. Doing so should have negligible cost in principle, but is programmatically challenging in the standard packages.

Members of the community have repeatedly asked for such features2 but the established automatic differentiation frameworks have yet to address such requests, as their focus has been rightly on improving their technical backbone. Features like those outlined above are not generally deﬁned for arbitrary functions, but rather emerge from the speciﬁc structure of machine learning applications. General automatic differentiation frameworks can not be expected to serve such specialist needs. This does not mean, however, that it is impossible to efﬁciently realize such features within these frameworks: In essence, backpropagation is a technique to compute multiplications with Jacobians. Methods to extract second-order information (Mizutani & Dreyfus, 2008) or individual gradients from a mini-batch (Goodfellow, 2015) have been known to a small group of specialists; they are just rarely discussed or implemented.

1.1 OUR CONTRIBUTION

To address this need for a specialized framework focused on machine learning, we propose a framework for the implementation of generalized backpropagation to compute additional quantities. The structure is based on the conceptual work of Dangel et al. (2019) for modular backpropagation. This framework can be built on top of existing graph-based backpropagation modules; we provide an implementation on top of PYTORCH, coined BACKPACK, available at

https://f-dangel.github.io/backpack/. The initial release supports efﬁcient computation of individual gradients from a mini-batch, their ℓ2 norm, an estimate of the variance, as well as diagonal and Kronecker factorizations of the generalized Gauss-Newton (GGN) matrix (see Tab. 1 for a feature overview). The library was designed to be minimally verbose to the user, easy to use (see Fig. 1), and to have low overhead (see 3). While other researchers are aiming to improve the ﬂexibility of automatic differentiation systems (Innes, 2018a;b; Bradbury et al., 2018), our goal with this package is to provide access to quantities that are only byproducts of the backpropagation pass, rather than gradients themselves.

2 See, e.g., the Github issues github.com/pytorch/pytorch/issues/1407, 7786, 8897 and forum discussions discuss.pytorch.org/t/1433, 8405, 15270, 17204, 19350, 24955.

Published as a conference paper at ICLR 2020

Computing the gradient with PYTORCH ...

X, y = load mnist data() model = Linear(784, 10) lossfunc = Cross Entropy Loss()

loss = lossfunc(model(X), y)

loss.backward()

for param in model.parameters(): print(param.grad)

...and the variance with BACKPACK

X, y = load mnist data() model = extend(Linear(784, 10)) lossfunc = extend(Cross Entropy Loss())

loss = lossfunc(model(X), y) with backpack(Variance()):

loss.backward()

for param in model.parameters(): print(param.grad) print(param.var)

Figure 1: BACKPACK integrates with PYTORCH to seamlessly extract more information from the backward pass. Instead of the variance (or alongside it, in the same pass), BACKPACK can compute individual gradients in the mini-batch, their ℓ2 norm and 2nd moment. It can also compute curvature approximations like diagonal or Kronecker factorizations of the GGN such as KFAC, KFLR & KFRA.

To illustrate the capabilities of BACKPACK, we use it to implement preconditioned gradient descent optimizers with diagonal approximations of the GGN and recent Kronecker factorizations KFAC (Martens & Grosse, 2015), KFLR, and KFRA (Botev et al., 2017). Our results show that the curvature approximations based on Monte-Carlo (MC) estimates of the GGN, the approach used by KFAC, give similar progress per iteration to their more accurate counterparts, but being much cheaper to compute. While the na ıve update rule we implement does not surpass ﬁrst-order baselines such as SGD with momentum and Adam (Kingma & Ba, 2015), its implementation with various curvature approximations is made straightforward.

2 THEORY AND IMPLEMENTATION

We will distiguish between quantities that can be computed from information already present during a traditional backward pass (which we suggestively call ﬁrst-order extensions), and quantities that need additional information (termed second-order extensions). The former group contains additional statistics such as the variance of the gradients within the mini-batch or the ℓ2 norm of the gradient for each sample. Those can be computed with minimal overhead during the backprop pass. The latter class contains approximations of second-order information, like the diagonal or Kronecker factorization of the generalized Gauss-Newton (GGN) matrix, which require the propagation of additional information through the graph. We will present those two classes separately:

First-order extensions Extract more from the standard backward pass.

Individual gradients from a mini-batch ℓ2 norm of the individual gradients Diagonal covariance and 2nd moment

Second-order extensions Propagate new information along the graph.

Diagonal of the GGN and the Hessian KFAC (Martens & Grosse, 2015) KFRA and KFLR (Botev et al., 2017)

These quantities are only deﬁned, or reasonable to compute, for a subset of models: The concept of individual gradients for each sample in a mini-batch or the estimate of the variance requires the loss for each sample to be independent. While such functions are common in machine learning, not all neural networks ﬁt into this category. For example, if the network uses Batch Normalization (Ioffe & Szegedy, 2015), the individual gradients in a mini-batch are correlated. Then, the variance is not meaningful anymore, and computing the individual contribution of a sample to the mini-batch gradient or the GGN becomes prohibitive. For those reasons, and to limit the scope of the project for version 1.0, BACKPACK currently restricts the type of models it accepts. The supported models are traditional feed-forward networks that can be expressed as a sequence of modules, for example a sequence of convolutional, pooling, linear and activation layers. Recurrent networks like LSTMs (Hochreiter & Schmidhuber, 1997) or residual networks (He et al., 2016) are not yet supported, but the framework can be extended to cover them.

We assume a sequential model f : Θ X Y and a dataset of N samples (xn, yn) X Y with n = 1, . . . , N. The model maps each sample xn to a prediction ˆyn using some parameters θ Θ. The predictions are evaluated with a loss function ℓ: Y Y R, for example the cross-entropy,

Published as a conference paper at ICLR 2020

{z(i 1) n }N n=1 {z(i) n }N n=1

{ z(i 1) n ℓn}N n=1 { z(i) n ℓn}N n=1

θ(i) L θ(i)

T (i) θ(i) z(i 1) n = z(i) n

Figure 2: Schematic representation of the standard backpropagation pass for module i with N samples.

which compares them to the ground truth yn. This leads to the objective function L : Θ R,

N PN n=1 ℓ(f(θ, xn), yn) . (1)

As a shorthand, we will use ℓn(θ) = ℓ(f(θ, xn), yn) for the loss and fn(θ) = f(θ, xn) for the model output of individual samples. Our goal is to provide more information about the derivatives of {ℓn}N n=1 with respect to the parameters θ of the model f.

2.1 PRIMER ON BACKPROPAGATION

Machine learning libraries with integrated automatic differentiation use the modular structure of fn(θ) to compute derivatives (see Baydin et al. (2018) for an overview). If fn is a sequence of L transformations, it can be expressed as

fn(θ) = T (L) θ(L) . . . T (1) θ(1)(xn) , (2)

where T (i) θ(i) is the ith transformation with parameters θ(i), such that θ = [θ(1), . . . , θ(L)]. The

loss function can also be seen as another transformation, appended to the network. Let z(i 1) n , z(i) n denote the input and output of the operation T (i) θ(i) for sample n, such that z(0) n is the original data

and z(1) n , , z(L) n represent the transformed output of each layer, leading to the computation graph

θ(1)(z(0) n ) z(1) n

θ(2)(z(1) n ) . . .

θ(L)(z(L 1) n ) z(L) ℓ(z(L) n ,yn) ℓn(θ) .

To compute the gradient of ℓn with respect to the θ(i), one can repeatedly apply the chain rule,

θ(i)ℓ(θ) = (Jθ(i)z(i) n ) (Jz(i) n z(i+1) n ) . . . (Jz(L 1) n z(L) n ) ( z(L) n ℓn(θ))

= (Jθ(i)z(i) n ) ( z(i)ℓn(θ)) , (3)

where Jab is the Jacobian of b with respect to a, [Jab]ij = [b]i/ [a]j. A similar expression exists for the module inputs z(i 1) n : z(i 1) n ℓn(θ) = (Jz(i 1) n z(i) n ) ( z(i) n ℓn(θ)). This recursive structure makes it possible to extract the gradient by propagating the gradient of the loss. In the backpropagation algorithm, a module i receives the loss gradient with respect to its output, z(i) n ℓn(θ). It then extracts the gradient with respect to its parameters and inputs, θ(i)ℓn(θ) and z(i 1) n ℓn(θ), according to Eq. 3. The gradient with respect to its input is sent further down the graph. This process, illustrated in Fig. 2, is repeated for each transformation until all gradients are computed. To implement backpropagation, each module only needs to know how to multiply with its Jacobians.

For second-order quantities, we rely on the work of Mizutani & Dreyfus (2008) and Dangel et al. (2019), who showed that a scheme similar to Eq. 3 exists for the block-diagonal of the Hessian. A block with respect to the parameters of a module, 2 θ(i)ℓn(θ), can be obtained by the recursion

2 θ(i)ℓn(θ) = (Jθ(i)z(i) n ) ( 2 z(i) n ℓn(θ))(Jθ(i)z(i) n ) + P

j 2 θ(i)[z(i) n ]j h z(i) n ℓn(θ) i

and a similar relation holds for the Hessian with respect to each module s output, 2 z(i) n ℓn(θ).

Both backpropagation schemes of Eq. 3 and Eq. 4 hinge on the multiplication by Jacobians to both vectors and matrices. However, the design of automatic differentiation limits the application of Jacobians to vectors only. This prohibits the exploitation of vectorization in the matrix case, which is needed for second-order information. The lacking ﬂexibility of Jacobians is one motivation for our work. Since all quantities needed to compute statistics of the derivatives are already computed during the backward pass, another motivation is to provide access to them at minor overhead.

Published as a conference paper at ICLR 2020

For-loop Back PACK Gradient

Batch gradients

Batch size 64 128 256

Figure 3: Computing individual gradients in a batch using a for-loop (i.e. one individual forward and backward pass per sample) or using vectorized operations with BACKPACK. The plot shows computation time, comparing to a traditional gradient computation, on the 3C3D network (See 4) for the CIFAR-10 dataset (Schneider et al., 2019).

{z(i 1) n }N n=1 {z(i) n }N n=1

{ z(i 1) n ℓn}N n=1 { z(i) n ℓn}N n=1

{ θ(i) n ℓn}N n=1

θ(i) L θ(i)

T (i) θ(i) z(i 1) n = z(i) n

Figure 4: Schematic representation of the individual gradients extraction in addition to the standard backward pass at the ith module for N samples.

2.2 FIRST ORDER EXTENSIONS

As the principal ﬁrst-order extension, consider the computation of the individual gradients in a batch of size N. These individual gradients are implicitly computed during a traditional backward pass because the batch gradient is their sum, but they are not directly accessible. The na ıve way to compute N individual gradients is to do N separate forward and backward passes, This (inefﬁciently) replaces every matrix-matrix multiplications by N matrix-vector multiplications. BACKPACK s approach batches computations to obtain large efﬁciency gains, as illustrated by Fig. 3.

As the quantities necessary to compute the individual gradients are already propagated through the computation graph, we can reuse them by inserting code in the standard backward pass. With access to this information, before it is cleared for memory efﬁciency, BACKPACK computes the Jacobianmultiplications for each sample

{ θ(i)ℓn(θ)}N n=1 = {[Jθ(i)z(i) n ] z(i) n ℓn(θ)}N n=1 , (5)

without summing the result see Fig. 4 for a schematic representation. This duplicates some of the computation performed by the backpropagation, as the Jacobian is applied twice (once by PYTORCH and BACKPACK with and without summation over the samples, respectively). However, the associated overhead is small compared to the for-loop approach: The major computational cost arises from the propagation of information required for each layer, rather than the formation of the gradient within each layer.

This scheme for individual gradient computation is the basis for all ﬁrst-order extensions. In this direct form, however, it is expensive in memory: if the model is D-dimensional, storing O(ND) elements is prohibitive for large batches. For the variance, 2nd moment and ℓ2 norm, BACKPACK takes advantage of the Jacobian s structure to directly compute them without forming the individual gradient, reducing memory overhead. See Appendix A.1 for details.

2.3 SECOND-ORDER EXTENSIONS

Second-order extensions require propagation of more information through the graph. As an example, we will focus on the generalized Gauss-Newton (GGN) matrix (Schraudolph, 2002). It is guaranteed to be positive semi-deﬁnite and is a reasonable approximation of the Hessian near the minimum, which motivates its use in approximate second-order methods. For popular loss functions, it coincides with the Fisher information matrix used in natural gradient methods (Amari, 1998); for a more in depth discussion of the equivalence, see the reviews of Martens (2014) and Kunstner et al. (2019). For an objective function that can be written as the composition of a loss function ℓand a model f, such as Eq. 1, the GGN of 1

n ℓ(f(θ, xn), yn) is

n [Jθf(θ, xn)] 2 f ℓ(f(θ, xn), yn) [Jθf(θ, xn)] . (6)

The full matrix is too large to compute and store. Current approaches focus on its diagonal blocks, where each block corresponds to a layer in the network. Every block itself is further approximated,

Published as a conference paper at ICLR 2020

{z(i 1) n }N n=1 {z(i) n }N n=1

{ z(i 1) n ℓn}N n=1 { z(i) n ℓn}N n=1

{[Jz(i) n fn] Sn}N n=1 {[Jz(i 1) n fn] Sn}N n=1 {[Jθ(i) fn] Sn}N n=1

θ(i) L θ(i)

T (i) θ(i) z(i 1) n = z(i) n

Figure 5: Schematic of the additional backward pass to compute a symmetric factorization of the GGN, G(θ) = P

n[Jθfn] Sn S n [Jθfn] alongside the gradient at the ith module, for N samples.

for example using a Kronecker factorization. The approach used by BACKPACK for their computation is a reﬁnement of the Hessian Backpropagation equations of Dangel et al. (2019). It relies on two insights: Firstly, the computational bottleneck in the computation of the GGN is the multiplication with the Jacobian of the network, Jθfn, while the Hessian of the loss with respect to the output of the network is easy to compute for most popular loss functions. Secondly, it is not necessary to compute and store each of the N [D D] matrices for a network with D parameters, as Eq. 6 is a quadratic expression. Given a symmetric factorization Sn of the Hessian, Sn S n = 2 f ℓ(f(θ, xn), yn), it is sufﬁcient to compute [Jθfn] Sn and square the result. A network output is typically small compared to its inner layers; networks on CIFAR-100 need C = 100 class outputs but could use convolutional layers with more than 100,000 parameters.

The factorization leads to a [D C] matrix, which makes it possible to efﬁciently compute GGN block diagonals. Also, the computation is very similar to that of a gradient, which computes [Jθfn] fnℓn. A module T (i) θ(i) receives the symmetric factorization of the GGN with respect to its output, z (i) n , and

multiplies it with the Jacobians with respect to the parameters θ(i) and inputs z(i 1) n to produce a symmetric factorization of the GGN with respect to the parameters and inputs, as shown in Fig. 5.

This propagation serves as the basis of the second-order extensions. If the full symmetric factorization is not wanted, for memory reasons, it is possible to extract more speciﬁc information such as the diagonal. If B is the symmetric factorization for a GGN block, the diagonal can be computed as [BB ]ii = P

j[B]2 ij, where [ ]ij denotes the element in the ith row and jth column.

This framework can be used to extract the main Kronecker factorizations of the GGN, KFAC and KFLR, which we extend to convolution using the approach of Grosse & Martens (2016). The important difference between the two methods is the initial matrix factorization Sn. Using a full symmetric factorization of the initial Hessian, Sn S n = 2 fnℓn, yields the KFLR approximation. KFAC uses an MC-approximation by sampling a vector sn such that Esn[sns n ] = 2 fnℓn. KFLR is therefore more precise but more expensive than KFAC, especially for networks with high-dimensional outputs, which is reﬂected in our benchmark on CIFAR-100 in Section 3. The technical details on how Kronecker factors are extracted and information is propagated for second-order BACKPACK extensions are documented in Appendix A.2.

3 EVALUATION AND BENCHMARKS

We benchmark the overhead of BACKPACK on the CIFAR-10 and CIFAR-100 datasets, using the 3C3D network3 provided by DEEPOBS (Schneider et al., 2019) and the ALL-CNN-C4 network of Springenberg et al. (2015). The results are shown in Fig. 6.

For ﬁrst-order extensions, the computation of individual gradients from a mini-batch adds noticeable overhead due to the additional memory requirements of storing them. But more speciﬁc quantities such as the ℓ2 norm, 2nd moment and variance can be extracted efﬁciently. Regarding second-order extensions, the computation of the GGN can be expensive for networks with large outputs like CIFAR100, regardless of the approximation being diagonal of Kronecker-factored. Thankfully, the MC approximation used by KFAC, which we also implement for a diagonal approximation, can be computed at minimal overhead much less than two backward passes. This last point is encouraging, as our optimization experiment in Section 4 suggest that this approximation is reasonably accurate.

33C3D is a sequence of 3 convolutions and 3 dense linear layers with 895,210 parameters. 4ALL-CNN-C is a sequence of 9 convolutions with 1,387,108 parameters.

Published as a conference paper at ICLR 2020

Grad (ref.)

Diag GGN-MC

Indiv. Grad

3C3D on CIFAR10

Batch = 32 48 64

Grad (ref.)

Diag GGN-MC

Indiv. Grad 0

All-CNN-C on CIFAR100

Batch = 32 48 64

Figure 6: Overhead benchmark for computing the gradient and ﬁrstor second-order extensions on real networks, compared to just the gradient. Most quantities add little overhead. KFLR and Diag GGN propagate 100 more information than KFAC and Diag GGN-MC on CIFAR-100 and are two orders of magnitude slower. We report benchmarks on those, and the Hessian s diagonal, in Appendix B.

4 EXPERIMENTS

To illustrate the utility of BACKPACK, we implement preconditioned gradient descent optimizers using diagonal and Kronecker approximations of the GGN. To our knowledge, and despite their apparent simplicity, results using diagonal approximations or the na ıve damping update rule we chose have not been reported in publications so far. However, this section is not meant to introduce a bona-ﬁde new optimizer. Our goal is to show that BACKPACK can enable research of this kind. The update rule we implement uses a curvature matrix G(θt (i)), which could be a diagonal or Kronecker factorization of the GGN blocks, and a damping parameter λ to precondition the gradient:

θ(i) t+1 = θ(i) t α(G(θ(i) t ) + λI) 1 L(θ(i) t ) , i = 1, . . . , L . (7)

We run the update rule with the following approximations of the generalized Gauss-Newton: the exact diagonal (Diag GGN) and an MC estimate (Diag GGN-MC), and the Kronecker factorizations KFAC (Martens & Grosse, 2015), KFLR and KFRA5(Botev et al., 2017). The inversion required by the update rule is straightforward for the diagonal curvature. For the Kronecker-factored quantities, we use the approximation introduced by Martens & Grosse (2015) (see Appendix C.3).

These curvature estimates are tested for the training of deep neural networks by running the corresponding optimizers on the main test problems of the benchmarking suite DEEPOBS (Schneider et al., 2019).6 We use the setup (batch size, number of training epochs) of DEEPOBS baselines, and tune the learning rate α and damping parameter λ with a grid search for each optimizer (details in Appendix C.2). The best hyperparameter settings is chosen according to the ﬁnal accuracy on a validation set. We report the median and quartiles of the performance for ten random seeds.

Fig. 7a shows the results for the 3C3D network trained on CIFAR-10. The optimizers that leverage Kronecker-factored curvature approximations beat the baseline performance in terms of per-iteration progress on the training loss, training and test accuracy. Using the same hyperparameters, there is little difference between KFAC and KFLR, or Diag GGN and Diag GGN-MC. Given that the quantities based on MC-sampling are considerably cheaper, this experiment suggests it being an important technique for reducing the computational burden of curvature approximations.

Fig. 7b shows benchmarks for the ALL-CNN-C network trained on CIFAR-100. Due to the highdimensional output, the curvatures using a full matrix propagation rather than an MC sample cannot be run on this problem due to memory issues. Both Diag GGN-MC and KFAC can compete with the baselines in terms of progress per iteration. As the update rule we implemented is simplistic on purpose, this is promising for future applications of second-order methods that can more efﬁciently use the additional information given by curvature approximations.

5 KFRA was not originally designed for convolutions; we extend it using the Kronecker factorization of Grosse & Martens (2016). While it can be computed for small networks on MNIST, which we report in Appendix C.4, the approximate backward pass of KFRA does not seem to scale to large convolution layers. 6https://deepobs.github.io/. We cannot run BACKPACK on all test problems in this benchmark due to the limitations outlined in Section 2. Despite this limitation, we still run on models spanning a representative range of image classiﬁcation problems.

Published as a conference paper at ICLR 2020

(a) CIFAR-10: 3C3D

0 20 40 60 80 100 0.5

Diag GGN Diag GGN-MC

Adam Momentum

0 20 40 60 80 100

Test accuracy

0 20 40 60 80 1000.5

0 20 40 60 80 100

Train accuracy

(b) CIFAR-100: ALL-CNN-C

0 100 200 300 1

Diag GGN-MC KFAC

Adam Momentum

0 100 200 300 0.4

Test accuracy

0 100 200 300 1

0 100 200 300

Train accuracy

Figure 7: Median performance with shaded quartiles of the DEEPOBS benchmark for (a) 3C3D network (895,210 parameters) on CIFAR-10 and (b) ALL-CNN-C network (1,387,108 parameters) on CIFAR-100. Solid lines show baselines of momentum SGD and Adam provided by DEEPOBS.

5 CONCLUSION

Machine learning s coming-of-age has been accompanied, and in part driven, by a maturing of the software ecosystem. This has drastically simpliﬁed the lives of developers and researchers alike, but has also crystallized parts of the algorithmic landscape. This has dampened research in cutting-edge areas that are far from mature, like second-order optimization for deep neural networks. To ensure that good ideas can bear fruit, researchers must be able to compute new quantities without an overwhelming software development burden. To support research and development in optimization for deep learning, we have introduced BACKPACK, an efﬁcient implementation in PYTORCH of recent conceptual advances and extensions to backpropagation (Tab. 1 lists all features). BACKPACK enriches the syntax of automatic differentiation packages to offer additional observables to optimizers beyond the batch-averaged gradient. Our experiments demonstrate that BACKPACK s implementation offers drastic efﬁciency gains over the kind of na ıve implementation within reach of the typical researcher. As a demonstrative example, we invented a few optimization routines that, without BACKPACK, would require demanding implementation work and can now be tested with ease. We hope that studies like this allow BACKPACK to help mature the ML software ecosystem further.

ACKNOWLEDGMENTS

The authors would like to thank Aaron Bahde, Ludwig Bald, and Frank Schneider for their help with DEEPOBS and Lukas Balles, Simon Bartels, Filip de Roos, Tim Fischer, Nicolas Kr amer, Agustinus Kristiadi, Frank Schneider, Jonathan Wenger, and Matthias Werner for constructive feedback.

The authors gratefully acknowledge ﬁnancial support by the European Research Council through ERC St G Action 757275 / PANAMA; the DFG Cluster of Excellence Machine Learning - New

Published as a conference paper at ICLR 2020

Table 1: Overview of the features supported in the ﬁrst release of BACKPACK.

Feature Details

Individual gradients 1 N θ(i)ℓn(θ), n = 1, . . . , N

Batch variance 1 N PN n=1 [ θ(i)ℓn(θ)]2 j [ θ(i)L(θ)]2 j 2nd moment 1 N PN n=1 [ θ(i)ℓn(θ)]2 j , j = 1, . . . , d(i).

Indiv. gradient ℓ2 norm 1

N θ(i)ℓn(θ) 2 2 , n = 1, . . . , N Diag GGN diag G(θ(i))

Diag GGN-MC diag G(θ(i))

Hessian diagonal diag 2 θ(i)L(θ)

KFAC G(θ(i)) A(i) B(i) KFAC

KFLR G(θ(i)) A(i) B(i) KFLR

KFRA G(θ(i)) A(i) B(i) KFRA

Perspectives for Science , EXC 2064/1, project number 390727645; the German Federal Ministry of Education and Research (BMBF) through the T ubingen AI Center (FKZ: 01IS18039A); and funds from the Ministry of Science, Research and Arts of the State of Baden-W urttemberg. F. D. is grateful to the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for support.

Mart ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorﬂow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, 2016.

Shun-ichi Amari. Natural gradient works efﬁciently in learning. Neural Computation, 10(2), 1998.

Lukas Balles and Philipp Hennig. Dissecting Adam: The sign, magnitude and variance of stochastic gradients. In Proceedings of the 35th International Conference on Machine Learning, 2018.

Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates. In Proceedings of the 33rd Conference on Uncertainty in Artiﬁcial Intelligence, 2017.

Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: A survey. Journal of Machine Learning Research, 18(153), 2018.

Sue Becker and Yann Le Cun. Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 Connectionist Models Summer School, 1989.

Antoine Bordes, L eon Bottou, and Patrick Gallinari. SGD-QN: careful quasi-Newton stochastic gradient descent. J. Mach. Learn. Res., 10, 2009.

Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical Gauss-Newton optimisation for deep learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 2017.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: Composable transformations of Python+Num Py programs, 2018.

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A ﬂexible and efﬁcient machine learning library for heterogeneous distributed systems. In 31st Conference on Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015.

Felix Dangel, Stefan Harmeling, and Philipp Hennig. A modular approach to block-diagonal Hessian approximations for second-order optimization methods. Co RR, abs/1902.01813, 2019.

Published as a conference paper at ICLR 2020

Thomas George, C esar Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a Kronecker-factored eigenbasis. 2018.

Ian J. Goodfellow. Efﬁcient per-example gradient computations. Co RR, abs/1510.01799, 2015.

Roger B. Grosse and James Martens. A Kronecker-factored approximate Fisher matrix for convolution layers. In Proceedings of the 33rd International Conference on Machine Learning, volume 48 of JMLR Workshop and Conference Proceedings, 2016.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016.

Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997.

Michael Innes. Flux: Elegant machine learning with Julia. Journal of Open Source Software, 3(25), 2018a.

Michael Innes. Don t unroll adjoint: Differentiating SSA-form programs. Co RR, abs/1810.07951, 2018b.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Workshop and Conference Proceedings, 2015.

Angelos Katharopoulos and Franc ois Fleuret. Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the 35th International Conference on Machine Learning, 2018.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, 2015.

Frederik Kunstner, Lukas Balles, and Philipp Hennig. Limitations of the empirical Fisher approximatiom. In Advances in Neural Information Processing Systems 32, 2019.

Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In Advances in Neural Information Processing Systems 20, 2007.

Maren Mahsereci and Philipp Hennig. Probabilistic line searches for stochastic optimization. Journal of Machine Learning Research, 18, 2017.

James Martens. New perspectives on the natural gradient method. Co RR, abs/1412.1193, 2014.

James Martens and Roger B. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Workshop and Conference Proceedings, 2015.

James Martens, Jimmy Ba, and Matt Johnson. Kronecker-factored curvature approximations for recurrent neural networks. In 6th International Conference on Learning Representations, 2018.

Eiji Mizutani and Stuart E. Dreyfus. Second-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature. Neural Networks, 21(2-3), 2008.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py Torch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. 2019.

Frank Schneider, Lukas Balles, and Philipp Hennig. Deep OBS: A deep learning optimizer benchmark suite. In 7th International Conference on Learning Representations, 2019.

Nicol N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7), 2002.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for simplicity: The all convolutional net. In 3rd International Conference on Learning Representations, 2015.

Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: A next-generation open source framework for deep learning. In 29th Conference on Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015.

Yohei Tsuji, Kazuki Osawa, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Performance optimizations and analysis of distributed deep learning with approximated second-order optimization method. In 48th International Conference on Parallel Processing, Workshop Proceedings, 2019.

Published as a conference paper at ICLR 2020

BACKPACK: PACKING MORE INTO BACKPROP SUPPLEMENTARY MATERIAL

Table of Content

A: BACKPACK extensions

A.1: First-order quantities A.2: Second-order quantities based on the generalized Gauss-Newton A.3: The exact Hessian diagonal B: Additional details on benchmarks C: Additional details on experiments D: BACKPACK cheat sheet

A BACKPACK EXTENSIONS

This section provides more technical details on the additional quantities extracted by BACKPACK.

Notation: Consider an arbitrary module T (i) θ(i) of a network i = 1, . . . , L, parameterized by θ(i). It

transforms the output of its parent layer for sample n, z(i 1) n , to its output z(i) n , i.e.

z(i) n = T (i) θ(i)(z(i 1) n ) , n = 1, . . . , N , (8)

where N is the number of samples. In particular, z(0) n = xn and z(L) n (θ) = f(xn, θ), where f is the transformation of the whole network. The dimension of the hidden layer i s output z(i) n is written h(i) and θ(i) is of dimension d(i). The dimension of the network output, the prediction z(L), is h(L) = C. For an image classiﬁcation task, C corresponds to the number of classes.

All quantities are assumed to be vector-shaped. For image-processing transformations that usually act on tensor-shaped inputs, we can reduce to the vector scenario by vectorizing all quantities; this discussion does not rely on a speciﬁc ﬂattening scheme. However, for an efﬁcient implementation, vectorization should match the layout of the memory of the underlying arrays.

Jacobian: The Jacobian matrix Jab of an arbitrary vector b RB with respect to another vector a RA is an [A B] matrix of partial derivatives, [Jab]ij = [b]i / [a]j.

A.1 FIRST-ORDER QUANTITIES

The basis for the extraction of additional information about ﬁrst-order derivatives is given by Eq. 3, which we state again for multiple samples,

θ(i)L(θ) = 1

n=1 θ(i)ℓn(θ) = 1

n=1 (Jθ(i)z(i) n ) ( z(i) n ℓn(θ)) .

During the backpropagation step of module i, we have access to z(i) n ℓ(θ), i = 1, . . . , N. To extract

more quantities involving the gradient, we use additional information about the transformation T (i) θ(i) within our custom implementation of the Jacobian Jθ(i)z(i) n and transposed Jacobian (Jθ(i)z(i) n ) .

Individual gradients: The contribution of each sample to the overall gradient, 1 N θ(i)ℓn(θ), is computed by application of the transposed Jacobian,

1 N θ(i)ℓn(θ) = 1

N (Jθ(i)z(i) n ) ( z(i) n ℓn(θ)) , n = 1, . . . , N . (9)

For each parameter θ(i) the individual gradients are of size [N d(i)].

Published as a conference paper at ICLR 2020

Individual gradient ℓ2 norm: The quantity 1

N θ(i)ℓn(θ) 2 2, for n = 1, ..., N, could be extracted from the individual gradients (Eq. 9) as 1 N θ(i)ℓn(θ)

N (Jθ(i)z(i) n ) ( z(i) n ℓn(θ)) 1

N (Jθ(i)z(i) n ) ( z(i) n ℓn(θ)) ,

which is an N-dimensional object for each parameter θ(i). However, this is not memory efﬁcient as the individual gradients are an [N d(i)] tensor. To circumvent this problem, BACKPACK uses the structure of the Jacobian whenever possible.

For a speciﬁc example, take a linear layer with parameters θ as an [A B] matrix. The layer transforms the inputs z(i 1) n , an [N A] matrix which we will now refer to as A. During the backward pass, it receives the gradient of the individual losses with respect to its output, { 1

N z(i) n ℓn}N n=1, as an [N B] matrix which we will refer to as B. The overall gradient, an [A B] matrix, can be computed as A B, and the individual gradients are a set of N [A B] matrices, {A[n, :]B[n, :] }N n=1. We want to avoid storing that information. To reduce the memory requirement, note that the individual gradient norm can be written as 1 N θℓn

j (A[n, i]B[n, j])2 ,

and that the summation can be done independently for each matrix, as P

j(A[n, i]B[n, j])2 = (P

i A[n, i])2(P

j B[n, j]2). Therefore, we can square each matrix (element-wise) and sum over non-batch dimensions. This yields vectors a, b of N elements, where a[n] = P i A[n, i]2. The individual gradients ℓ2 norm is then given by a b where is element-wise multiplication.

Second moment: The gradient second moment (or more speciﬁcally, the diagonal of the second moment) is the sum of the squared elements of the individual gradients in a mini-batch, i.e.

n=1 [ θ(i)ℓn(θ)]2 j , j = 1, . . . , d(i) . (10)

It can be used to evaluate the variance of individual elements of the gradient (see below). The second moment is of dimension d(i), the same dimension as the layer parameter θ(i). Similarly to the ℓ2 norm, it can be computed from individual gradients, but is more efﬁciently computed implicitly.

Revisiting the example of the linear layer from the individual ℓ2 norm computation, the second moment of the parameters θ[i, j] is given by P

n(A[n, i]B[n, j])2, which can be directly computed by taking the element-wise square of A and B element-wise, A2, B2, and computing A2 B2.

Variance: Gradient variances over a mini-batch (or more precisely, the diagonal of the covariance) can be computed using the second moment and the gradient itself,

n=1 [ θ(i)ℓn(θ)]2 j [ θ(i)L(θ)]2 j , j = 1, . . . , d(i) . (11)

The element-wise gradient variance of same dimension as the layer parameter θ(i), i.e. d(i).

A.2 SECOND-ORDER QUANTITIES BASED ON THE GENERALIZED GAUSS-NEWTON

The computation of quantities that originate from the approximations of the Hessian require an additional backward pass (see Dangel et al. (2019)). Most curvature approximations supported by BACKPACK rely on the generalized Gauss-Newton (GGN) matrix (Schraudolph, 2002)

n=1 (Jθf(xn, θ)) 2 fℓ(f(xn, θ), yn)(Jθf(xn, θ)) . (12)

One interpretation of the GGN is that it corresponds to the empirical risk Hessian when the model f is approximated with its ﬁrst-order Taylor expansion, i.e. by linearizing the network and ignoring

Published as a conference paper at ICLR 2020

second-order effects. Hence, the effect of module curvature in the recursive scheme of Eq. 4 can be ignored to obtain the simpler expression

G(θ(i)) = 1

n=1 (Jθ(i)f) 2 fℓ(f(xn, θ), yn)(Jθ(i)f)

n=1 (Jθ(i)z(i) n ) G(z(i) n )(Jθ(i)z(i) n )

for the exact block diagonal of the full GGN. In analogy to G(θ(i)) we have introduced the [d(i) d(i)]-dimensional quantity

G(z(i) n ) = (Jz(i) n f) 2 fℓ(f(xn, θ), yn)(Jz(i) n f)

that needs to be backpropagated. The curvature backpropagation also follows from Eq. 4 as

G(z(i 1) n ) = (Jz(i 1) n z(i) n ) G(z(i) n )(Jz(i 1) n z(i) n ) , i = 1, . . . , L , (14a)

and is initialized with the Hessian of the loss function with respect to the network prediction, i.e.

G(z(L) n ) = 2 fℓ(f(xn, θ), yn) . (14b)

Although this scheme is exact, it is computationally infeasible as it requires the backpropagation of N [h(i) h(i)] matrices between module i + 1 and i. Even for small N, this is not possible for networks containing large convolutions.

As an example, the ﬁrst layer of the ALL-CNN-C network outputs 29 29 images with 96 channels, which already gives h(i) = 80,736, which leads to half a Gigabyte per sample. Moreover, storing all the [d(i) d(i)]-dimensional blocks G(θ(i)) is not possible. BACKPACK implements different approximation strategies, developed by Martens & Grosse (2015) and Botev et al. (2017) that address both of these complexity issues from different perspectives.

Symmetric factorization scheme: One way to improve the memory footprint of the backpropagated matrices in the case where the model prediction s dimension C (the number of classes in an image classiﬁcation task) is small compared to all hidden features h(i) is to propagate a symmetric factorization of the GGN instead. It relies on the observation that if the loss function itself is convex, even though its composition with the network might not be, its Hessian with respect to the network output can be decomposed as

2 fℓ(f(xn, θ), yn) = S(z(L) n )S(z(L) n ) (15)

with the [C C]-dimensional matrix factorization of the loss Hessian, S(z(L) n ), for sample n. Consequently, the GGN in Eq. 12 reduces to an outer product,

h (Jθf) S(z(L) n ) i h (Jθf) S(z(L) n ) i . (16)

The analogue for diagonal blocks follows from Eq. 13 and reads

G(θ(i)) = 1

h (Jθ(i)z(i) n ) S(z(i) n ) i h (Jθ(i)z(i) n ) S(z(i) n ) i , (17)

where we deﬁned the [h(i) C]-dimensional matrix square root S(z(i) n ) = (Jz(i) n f) S(z(L) n ). Instead of having layer i backpropagate N objects of shape [h(i) h(i)] according to Eq. 14, we instead backpropagate the matrix square root via

S(z(i 1) n ) = (Jz(i 1) n z(i) n ) S(z(i) n )(Jz(i 1) n z(i) n ) , i = 1, . . . , L , (18)

starting with Eq. 15. This reduces the backpropagated matrix of layer i to [h(i) C] for each sample.

Published as a conference paper at ICLR 2020

A.2.1 DIAGONAL CURVATURE APPROXIMATIONS

Diagonal of the GGN (Diag GGN): The factorization trick for the loss Hessian reduces the size of the backpropagated quantities, but does not address the intractable size of the GGN diagonal blocks G(θ(i)). In BACKPACK, we can extract diag G(θ(i)) given the backpropagated quantities S(z(i) n ), i = 1, . . . , N, without building up the matrix representation of Eq. 17. In particular, we compute

diag G(θ(i)) = 1

n=1 diag h (Jθ(i)z(i) n ) S(z(i) n ) i h (Jθ(i)z(i) n ) S(z(i) n ) i . (19)

Diagonal of the GGN with MC sampled loss Hessian (Diag GGN-MC): We use the same backpropagation strategy of Eq. 18, replacing the symmetric factorization of Eq. 15 with an approximation by a smaller matrix S(z(L) n ) of size [C C] and C < C,

2 fℓ(f(xn, θ), yn) S(z(L) n ) S(z(L) n ) . (20)

This further reduces the size of backpropagated curvature quantities. Martens & Grosse (2015) introduced such a sampling scheme with KFAC based on the connection between the GGN and the Fisher. Most loss functions used in machine learning have a probabilistic interpretation as negative log-likelihood of a probabilistic model. The squarred error of regression is equivalent to a Gaussian noise assumption and the cross-entropy is linked to the categorical distribution. In this case, the loss Hessian with respect to the network output is equal, in expectation, to the outer products of gradients if the output of the network is sampled according to a particular distribution, pf(x), deﬁned by the network output f(x). Sampling outputs ˆy p, we have that

Eˆy pf(x) θℓ(f(x, θ), ˆy) θℓ(f(x, θ), ˆy) = 2 θℓ(f(x, θ), y) . (21)

Sampling one such gradient leads to a rank-1 MC approximation of the loss Hessian. With the substitution S S, we compute an MC approximation of the GGN diagonal in BACKPACK as

diag G(θ(i)) 1

n=1 diag h (Jθ(i)z(i) n ) S(z(i) n ) i h (Jθ(i)z(i) n ) S(z(i) n ) i . (22)

A.2.2 KRONECKER-FACTORED CURVATURE APPROXIMATIONS

A different approach to reduce memory complexity of the GGN blocks G(θ(i)), apart from diagonal curvature approximations, is representing them as Kronecker products (KFAC for linear and convolution layers by Martens & Grosse (2015); Grosse & Martens (2016) KFLR and KFRA for linear layers by Botev et al. (2017)),

G(θ(i)) = A(i) B(i) . (23)

For both linear and convolution layers, the ﬁrst Kronecker factor A(i) is obtained from the inputs z(i 1) n to layer i. Instead of repeating the technical details of the aforementioned references, we will focus on how they differ in (i) the backpropagated quantities and (ii) the backpropagation strategy. As a result, we will be able to extend KFLR and KFRA to convolutional neural networks7.

KFAC and KFLR: KFAC uses an MC-sampled estimate of the loss Hessian with a square root factorization S(z(L) n ) like in Eq. 20. The backpropagation is equivalent to the computation of the GGN diagonal. For the GGN of the weights of a linear layer i, the second Kronecker term is given by

B(i) KFAC = 1

n=1 S(z(i) n ) S(z(i) n ) ,

7We keep the PYTORCH convention that weights and bias are treated as separate parameters. For the bias terms, we can store the full matrix representation of the GGN. This factor reappears in the Kronecker factorization of the GGN with respect to the weights.

Published as a conference paper at ICLR 2020

which at the same time corresponds to the GGN of the layer s bias8.

In contrast to KFAC, the KFLR approximation backpropagates the exact square root factorization S(z(L) n ), i.e. for the weights of a linear layer8 (see Botev et al. (2017) for more details)

B(i) KFLR = 1

n=1 S(z(i) n ) S(z(i) n ) .

KFRA: The backpropagation strategy for KFRA eliminates the scaling of the backpropagated curvature quantities with the batch size N in Eq. 14. Instead of having layer i receive the N exact [h(i) h(i)] matrices G(z(i) n ), n = 1, . . . , N, only a single averaged object, denoted G (i), is used as an approximation. In particular, the recursion changes to

G (i 1) = 1

n=1 (Jz(i 1) n z(i) n ) G (i)(Jz(i 1) n z(i) n ) , i = 1, . . . , L , (24a)

and is initialized with the batch-averaged loss Hessian

n=1 2 fℓ(f(xn, θ), yn) . (24b)

For a linear layer, KFRA uses8 (see Botev et al. (2017) for more details)

B(i) KFRA = G (i).

A.3 THE EXACT HESSIAN DIAGONAL

For neural networks consisting only of piecewise linear activation functions, computing the diagonal of the Hessian is equivalent to computing the GGN diagonal. This is because for these activations the second term in the Hessian backpropagation recursion (Eq. 4) vanishes.

However, for activation functions with non-vanishing second derivative, these residual terms have to be accounted for in the backpropagation. The Hessian backpropagation for module i reads

2 θ(i)ℓ(θ) = (Jθ(i)z(i) n ) ( 2 z(i) n ℓ(θ))(Jθ(i)z(i) n ) + R(i) n (θ(i)) , (25a)

2 z(i 1) n ℓ(θ) = (Jz(i 1) n z(i) n ) ( 2 z(i) n ℓ(θ))(Jz(i 1) n z(i) n ) + R(i) n (z(i 1) n ) , (25b)

for n = 1, . . . , N. Those [h(i) h(i)]-dimensional residual terms are deﬁned as

R(i) n (θ(i)) = X

2 θ(i)[z(i) n ]j h z(i) n ℓ(θ) i

R(i) n (z(i 1) n ) = X

2 z(i 1) n [z(i) n ]j h z(i) n ℓ(θ) i

For common parameterized layers, such as linear and convolution transformations, R(i) n (θ(i)) = 0. If the activation function is applied element-wise, R(i) n (z(i 1) n ) are diagonal matrices.

Storing these quantities becomes very memory-intensive for high-dimensional nonlinear activation layers. In BACKPACK, this complexity is reduced by application of the aforementioned matrix square root factorization trick. To do so, we express the symmetric factorization of R(i) n (z(i 1) n ) as

R(i) n (z(i 1) n ) = P (i) n (z(i 1) n ) P (i) n (z(i 1) n ) N (i) n (z(i 1) n ) N (i) n (z(i 1) n ) , (26)

where P (i) n (z(i 1) n ), N (i) n (z(i 1) n ) represent the matrix square root of R(i) n (z(i 1) n ) projected on its positive and negative eigenspace, respectively.

8In the case of convolutions, one has to sum over the spatial indices of a single channel of z(i) n as the bias is added to an entire channel, see Grosse & Martens (2016) for details.

Published as a conference paper at ICLR 2020

This composition allows for the extension of the GGN backpropagation: In addition to S(z(i) n ), the decompositions P (i) n (z(i 1) n ), N (i) n (z(i 1) n ) for the residual parts also have to be backpropagated according to Eq. 18. All diagonals are extracted from the backpropagated matrix square roots (see Eq. 19). All diagonals stemming from decompositions in the negative residual eigenspace have to be weighted by a factor of 1 before summation.

In terms of complexity, one backpropagation for R(i) n (z(i 1)) changes the dimensionality as follows

R(i) n (z(i 1)) : [h(i) h(i)] [h(i 1) h(i 1)] [h(i 2) h(i 2)] . . . .

With the square root factorization, one instead obtains

P (i) n (z(i 1) n ) : [h(i) h(i)] [h(i 1) h(i)] [h(i 2) h(i)] . . . ,

N (i) n (z(i 1) n ) : [h(i) h(i)] [h(i 1) h(i)] [h(i 2) h(i)] . . . .

Roughly speaking, this scheme is more efﬁcient whenever the hidden dimension of a nonlinear activation layer deceeds the largest hidden dimension of the network.

Example: Consider one backpropagation step of module i. Assume R(i) n (θ(i)) = 0, i.e. a linear, convolution, or non-parameterized layer. Then the following computations are performed in the protocol for the diagonal Hessian:

Receive the following quantities from the child module i + 1 (for n = 1, . . . , N)

Φ = n S(z(i) n ) ,

P (i+1) n (z(i) n ) ,

N (i+1) n (z(i) n ) ,

(Jz(i) n z(i+1) n ) P (i+2) n (z(i+1) n ) ,

(Jz(i) n z(i+1) n ) N (i+2) n (z(i+1) n ) ,

(Jz(i) n z(i+1) n ) (Jz(i+1) n z(i+2) n ) . . . (Jz(L 3) n z(L 2) n ) P (L 1) n (z(L 2) n ) ,

(Jz(i) n z(i+1) n ) (Jz(i+1) n z(i+2) n ) . . . (Jz(L 3) n z(L 2) n ) N (L 1) n (z(L 2) n ) o

Extract the module parameter Hessian diagonal, diag 2 θ(i)L(θ)

For each quantity A Φ extract the diagonal from the square root factorization and sum over the samples, i.e. compute

n=1 diag h (Jθ(i)z(i) n ) An i h (Jθ(i)z(i) n ) An i .

Multiply the expression by 1 if A stems from backpropagation of a residual s negative eigenspace s factorization. Sum all expressions to obtain the block Hessian s diagonal diag 2 θ(i)L(θ)

Backpropagate the received quantities to the parent module i 1

For each quantity An Φ, apply (Jz(i 1) n z(i) n ) An

Append P (i+1) n (z(i) n ) and N (i+1) n (z(i) n ) to Φ

B ADDITIONAL DETAILS ON BENCHMARKS

KFAC vs. KFLR: As the KFLR of Botev et al. (2017) is orders of magnitude more expensive to compute than the KFAC of Martens & Grosse (2015) on CIFAR-100, it was not included in the

Published as a conference paper at ICLR 2020

main plot. This is not an implementation error; it follows from the deﬁnition of those methods. To approximate the GGN, G(θ) = P

n[Jθfn] 2 fn ℓn [Jθfn], KFAC uses a rank-1 approximation for each of the inner Hessian 2 fn ℓn = sns n , and needs to propagate a vector through the computation graph for each sample. KFLR uses the complete inner Hessian instead. For CIFAR-100, the network has 100 output nodes one for each class and the inner Hessians are [100 100] matrices. KFLR needs to propagate a matrix through the computation graph for each sample, which is 100 more expensive as shown in Fig. 8.

Grad (ref.)

Diag GGN-MC

Indiv. Grad

CIFAR100 on All-CNN-C, B=16

Figure 8: KFLR and Diag GGN are more expensive to run on large networks. The gradient takes less than 20 ms to compute, but KFLR and Diag GGN are approximately 100 more expensive.

Diagonal of the GGN vs. Diagonal of the Hessian: Most networks used in deep learning use Re LU activation functions. Re LU functions have no curvature as they are piecewise linear. Because of this, the diagonal of the GGN is equivalent to the diagonal of the Hessian (Martens, 2014). However, for networks that use non piecewise linear activation functions like sigmoids or tanh, computing the Hessian diagonal can be much more expensive than the GGN diagonal. To illustrate this point, we modify the smaller network used in our benchmarks to include a single sigmoid activation function before the last classiﬁcation layer. The results in Fig. 9 show that the computation of the diagonal of the Hessian is already an order of magnitude more expensive than for the GGN.

Grad (ref.) Diag GGN Diag Hessian

CIFAR10 on 3C3D with one sigmoid

Batch = 4 8

Figure 9: Diagonal of the Hessian vs. the GGN. If the network contains a single sigmoid activation function, the diagonal of the Hessian is an order of magnitude more computationally intensive than the diagonal of the GGN.

C ADDITIONAL DETAILS ON EXPERIMENTS

C.1 PROTOCOL

The optimizer experiments are performed according to the protocol suggested by DEEPOBS:

Train the neural network with the investigated optimizer and vary its hyperparameters on a speciﬁed grid. This training is performed for a single random seed only.

DEEPOBS evaluates metrics during the training procedure. From all runs of the grid search, it selects the best run automatically. The results shown in this work were obtained with the default strategy, favoring highest ﬁnal accuracy on the validation set.

For a better understanding of the optimizer performance with respect to randomized routines in the training process, DEEPOBS reruns the best hyperparameter setting for ten different random seeds. The results show mean values over these repeated runs, with standard deviations as uncertainty indicators.

Along with the benchmarked optimizers, we show the DEEPOBS base line performances for Adam and momentum SGD (Momentum). They are provided by DEEPOBS.

Published as a conference paper at ICLR 2020

The optimizers built upon BACKPACK s curvature estimates were benchmarked on the DEEPOBS image classiﬁcation problems summarized in Table 2.

Table 2: Test problems considered from the DEEPOBS library (Schneider et al., 2019).

Codename Description Dataset # Parameters LOGREG Linear model MNIST 7,850 2C2D 2 convolutional and 2 dense linear layers FASHION-MNIST 3,274,634 3C3D 3 convolutional and 3 dense linear layers CIFAR-10 895,210 ALL-CNN-C 9 convolutional layers (Springenberg et al., 2015) CIFAR-100 1,387,108

C.2 GRID SEARCH AND BEST HYPERPARAMETER SETTING

Both the learning rate α and damping λ are tuned over the grid

α 10 4, 10 3, 10 2, 10 1, 1 , λ 10 4, 10 3, 10 2, 10 1, 1, 10 .

We use the same batch size (N = 128 for all problems, except N = 256 for ALL-CNN-C on CIFAR-100) as the base lines and the optimizers run for the identical number of epochs.

The best hyperparameter settings are summarized in Table 3.

C.3 UPDATE RULE

We use a simple update rule with a constant damping parameter λ. Consider the parameters θ of a single module in a neural network with ℓ2-regularization of strength η. Let G(θt) denote the curvature matrix and θL(θt) the gradient at step t. One iteration of the optimizer applies

θt+1 θt + [G(θt) + (λ + η)I)] 1 [ θL(θt) + ηθt] . (27)

The inverse cannot be computed exactly (in reasonable time) for the Kronecker-factored curvatures KFAC, KFLR, and KFRA. We use the scheme of Martens & Grosse (2015) to approximately invert G(θt)+(λ+η)I if G(θt) is Kronecker-factored; G(θt) = A(θt) B(θt). It replaces the expression (λ + θ)I by diagonal terms added to each Kronecker factor. In summary, this replaces

[A(θt) B(θt) + (λ + η)I] 1 by h A(θt) + πt p

λ + ηI i 1 B(θt) + 1

λ + ηI 1 . (28)

A principled choice for the parameter πt is given by πt = q

A(θt) IB IA B(θt) for an arbitrary matrix norm . We follow Martens & Grosse (2015) and choose the trace norm,

tr(A(θt)) dim(B) dim(A) tr(B(θt)) . (29)

Table 3: Best hyperparameter settings for optimizers and baselines shown in this work. In the Momentum baselines, the momentum was ﬁxed to 0.9. Parameters for computation of the running averages in Adam use the default values (β1, β2) = (0.9, 0.999). The symbols and denote whether the hyperparameter setting is an interior point of the grid or not, respectively.

Curvature mnist logreg fmnist 2c2d cifar10 3c3d cifar100 allcnnc α λ int α λ int α λ int α λ int Diag GGN 10 3 10 3 10 4 10 4 10 3 10 2 - - - Diag GGN-MC 10 3 10 3 10 4 10 4 10 3 10 2 10 3 10 3

KFAC 10 2 10 2 10 3 10 3 1 10 1 1

KFLR 10 2 10 2 10 2 10 3 1 10 - - -

KFRA 10 2 10 2 - - - - - - - - - Baseline α α α α Momentum 2.07 10 2 2.07 10 2 3.79 10 3 4.83 10 1

Adam 2.98 10 4 1.27 10 4 2.98 10 4 6.95 10 4

Published as a conference paper at ICLR 2020

C.4 ADDITIONAL RESULTS

This section presents the results for MNIST using a logistic regression in Fig. 10 and FASHIONMNIST using the 2C2D network, composed of two convolution and two linear layers, in Fig. 11.

MNIST: LOGREG

0 10 20 30 40 50 0.25

Diag GGN Diag GGN-MC

0 10 20 30 40 50

Test accuracy

0 10 20 30 40 50

0 10 20 30 40 50

Train accuracy

Figure 10: Median performance with shaded quartiles of the best hyperparameter settings chosen by DEEPOBS for logistic regression (7,850 parameters) on MNIST. Solid lines show well-tuned baselines of momentum SGD and Adam that are provided by DEEPOBS.

FASHION-MNIST: 2C2D

0 20 40 60 80 100 0

Diag GGN Diag GGN-MC

Adam Momentum

0 20 40 60 80 100

Test accuracy

0 20 40 60 80 100

0 20 40 60 80 100

Train accuracy

Figure 11: Median performance with shaded quartiles of the best hyperparameter settings chosen by DEEPOBS for the 2C2D network (3,274,634 parameters) on FASHION-MNIST. Solid lines show well-tuned baselines of momentum SGD and Adam that are provided by DEEPOBS.

Published as a conference paper at ICLR 2020

D BACKPACK CHEAT SHEET

Assumptions

Feedforward network

θ(1)(z(0) n ) z(1) n

θ(2)(z(1) n ) . . .

θ(L)(z(L 1) n ) z(L) ℓ(z(L) n ,y) ℓ(θ)

d(i) : Dimension of parameter θ(i)

Empirical risk L(θ) = 1

N PN n=1 ℓ(f(θ, xn), yn)

ℓn(θ) = ℓ(f(θ, xn), yn) , n = 1, . . . , N ,

fn(θ) = f(θ, xn) = z(L) n (θ) , n = 1, . . . , N

Generalized Gauss-Newton matrix

n=1 (Jθfn) 2 fnℓn(θ)(Jθfn)

Approximative GGN via MC sampling

n=1 (Jθfn) θℓ(fn(θ), ˆy) θℓ(fn(θ), ˆyn)

ˆyn pfn(xn) (Jθfn)

Table 4: Overview of the features supported in the ﬁrst release of BACKPACK. The quantities are computed separately for all module parameters, i.e. i = 1, . . . , L.

Feature Details

Individual gradients 1 N θ(i)ℓn(θ), n = 1, . . . , N Batch variance 1 N PN n=1 [ θ(i)ℓn(θ)]2 j [ θ(i)L(θ)]2 j , j = 1, . . . , d(i)

2nd moment 1 N PN n=1 [ θ(i)ℓn(θ)]2 j , j = 1, . . . , d(i). Indiv. gradient ℓ2 norm 1

N θ(i)ℓn(θ) 2 2 , n = 1, . . . , N Diag GGN diag G(θ(i))

Diag GGN-MC diag G(θ(i))

Hessian diagonal diag 2 θ(i)L(θ)

KFAC G(θ(i)) A(i) B(i) KFAC KFLR G(θ(i)) A(i) B(i) KFLR KFRA G(θ(i)) A(i) B(i) KFRA