# a_new_perspective_on_shampoos_preconditioner__681f8ea4.pdf

Published as a conference paper at ICLR 2025

A NEW PERSPECTIVE ON SHAMPOO S PRECONDITIONER

Depen Morwani Kempner Institute Harvard University dmorwani@g.harvard.edu

Itai Shapira SEAS Harvard University itaishapira@g.harvard.edu

Nikhil Vyas SEAS Harvard University nikhil@g.harvard.edu

Eran Malach Kempner Institute Harvard University emalach@g.harvard.edu

Sham Kakade Kempner Institute Harvard University sham@seas.harvard.edu

Lucas Janson Department of Statistics Harvard University ljanson@g.harvard.edu

Shampoo, a second-order optimization algorithm that uses a Kronecker product preconditioner, has recently received increasing attention from the machine learning community. Despite the increasing popularity of Shampoo, the theoretical foundations of its effectiveness are not well understood. The preconditioner used by Shampoo can be viewed as either an approximation of the Gauss Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. Our key contribution is providing an explicit and novel connection between the optimal Kronecker product approximation of these matrices and the approximation made by Shampoo. Our connection highlights a subtle but common misconception about Shampoo s approximation. In particular, the square of the approximation used by the Shampoo optimizer is equivalent to a single step of the power iteration algorithm for computing the aforementioned optimal Kronecker product approximation. Across a variety of datasets and architectures we empirically demonstrate that this is close to the optimal Kronecker product approximation. We also study the impact of batch gradients and empirical Fisher on the quality of Hessian approximation. Our findings not only advance the theoretical understanding of Shampoo but also illuminate potential pathways to enhance its practical performance.

1 INTRODUCTION

Second-order optimization methods offer significant theoretical advantages over first-order approaches, promising faster convergence rates by incorporating curvature information. Recently, these methods have seen success in practical large-scale training of neural networks such as Gemini 1.5 Flash (Gemini Team, 2024) and in the Algoperf benchmark (Dahl et al., 2023; MLCommons, 2024). One of the primary challenges in this field arises from the substantial memory and computational demands of traditional second-order methods, such as Adagrad (with full matrix) (Duchi et al., 2011) and Newton s method. When applying classical techniques, they require storing and inverting a |P| ˆ |P| dimensional matrix H (either covariance of the gradients for Adagrad or the Gauss Newton component of the Hessian for Newton s method), where |P| denotes the model parameters. With modern architectures often comprising billions of parameters, this leads to quadratic memory and cubic computational requirements, rendering direct application practically infeasible.

Equal contribution. Randomized Author Ordering.

Published as a conference paper at ICLR 2025

MNIST-2 CIFAR-5M Image Net

Gauss Newton

0 5 10 15 20 25 Training Steps

Cosine Similarity

0 2000 4000 6000 8000 10000 Training Steps

0 10000 20000 30000 40000 50000 Training Steps

0 5 10 15 20 25 Training Steps

Cosine Similarity

0 2000 4000 6000 8000 10000 Training Steps

0 10000 20000 30000 40000 50000 Training Steps

Optimal Kronecker Shampoo2 Shampoo

Figure 1: Comparison of Kronecker product approximations for the Gauss Newton Hessian (top) and Adagrad preconditioner (bottom) across datasets and architectures. Plots show the cosine similarity between the true matrix and three approximations: optimal Kronecker, Shampoo2, and Shampoo. Shampoo2 closely tracks the optimal Kronecker approximation, outperforming the original Shampoo method, consistent with Proposition 1. For MNIST-2 (binary subsampled), Shampoo2 perfectly correlates with the optimal Kronecker approximation as proved in Corollary 2 for binomial logistic regression. See Appendix C for dataset and architecture details.

To address this issue, one line of work focuses on efficient approximations of the matrix H (Gupta et al., 2018; Martens & Grosse, 2015). These methods typically employ either a diagonal approximation (e.g., Adam (Kingma, 2014)) or a layer-wise Kronecker product approximation of H. Such approaches are motivated by the significant memory and computational efficiency gains they offer compared to maintaining and inverting the full matrix H. Among the most prominent methods utilizing layer-wise Kronecker product approximations are K-FAC (Martens & Grosse, 2015) and Shampoo (Gupta et al., 2018).

In this work, we primarily focus on the Shampoo optimizer (Gupta et al., 2018), which has recently gained increasing attention from the research community. Notably, in the Algoperf benchmark of optimization algorithms proposed for practical neural network training workloads (Dahl et al., 2023), Shampoo appears to outperform all other existing methods. Another recent study, elucidating the Google Ads recommendation search pipeline, revealed that the Google Ads CTR model is trained using the Shampoo optimizer (Anil et al., 2022). Additionally, a recent work (Shi et al., 2023) implemented a distributed data parallel version of Shampoo, demonstrating its superior speed in training Image Net compared to other methods.

Previous research has introduced the concept of optimal Kronecker product approximation (in Frobenius norm) for a matrix M (Loan & Pitsianis, 1993). This involves finding the projection of M onto the set of matrices expressible as a Kronecker product of two smaller matrices. Loan & Pitsianis (1993) demonstrated that this optimal approximation can be computed numerically using a power iteration scheme. Our work presents a novel result (Proposition 1) that establishes a precise connection between Shampoo s approximation and this optimal Kronecker-factored approximation. Specifically, we show that the square of Shampoo s approximation is equivalent to a single step of the power iteration algorithm used to compute the optimal approximation.

Despite its empirical success, the theoretical foundations of Shampoo were not fully understood. Our results bridge this gap by connecting Shampoo to optimal Kronecker approximation of Newton and Adagrad methods. Moreover, our theoretical insights were utilized in the design of SOAP Vyas

Published as a conference paper at ICLR 2025

et al. (2024), a recently proposed optimizer that improves upon Adam W and Shampoo on language modeling tasks.

The main contributions of the work are summarized below:

We theoretically show (Proposition 1) that the square of the Shampoo s approximation of H is precisely equal to one round of the power iteration scheme for obtaining the optimal Kronecker factored approximation of the matrix H. Informally, for any covariance matrix H Ergg Js 1 where g P Rmn, we argue that the right Kronecker product approximation of H is Er GGJsb Er GJGs, while the original Shampoo work (Gupta et al., 2018) proposes Er GGJs1{2b Er GJGs1{2, with G P Rmˆn representing a reshaped g into an mˆn matrix.

We empirically verify that the result of one round of power iteration (i.e. square of the Shampoo s approximation) is very close to the optimal Kronecker factored approximation (Figure 1), and provide theoretical justification for the same (Section 3.2.1). Our experimental results also show that the approximation proposed by the original Shampoo work (Gupta et al., 2018), which does not use the square, is significantly worse.

For the Hessian based viewpoint of Shampoo (Section 2.2.2), we empirically demonstrate the impact on the Hessian approximation of various practical tricks implemented to make Shampoo more computationally efficient such as averaging gradients over batch (Section 4.1) and using empirical Fisher instead of the actual Fisher (Section 4.2).

Remark. It is worth noting that previous works (Balles et al., 2020; Lin et al., 2024) have investigated why Adagrad-based methods such as Adam and Shampoo incorporate an additional square root in their updates compared to the Hessian inverse. While this is an important question, it lies outside the scope of our work. For more details on this topic, we refer readers to Appendix G.

2 BACKGROUND

In this section, we provide the technical background, including basic definitions (Section 2.1), two perspectives on Shampoo s optimization (Adagrad in Section 2.2.1, Hessian in Section 2.2.2), and the theory of the optimal Kronecker product approximation (Section 2.3).

2.1 NOTATION AND BASIC DEFINITIONS

We use lowercase letters to denote scalars and vectors, and uppercase letters to denote matrices. For a symmetric matrix A, A ě 0 (resp. A ą 0) denotes that A is positive semi-definite (PSD) (resp. positive definite). Similarly, for symmetric matrices A and B, A ě B (resp. A ą B) means A B ě 0 (resp. A B ą 0). The identity matrix of size n is denoted by In. We use Mri, js to refer to the pi, jq entry of the matrix M. The Kronecker product of two matrices A P Rpˆq and B P Rrˆs is denoted by A b B P Rprˆqs. It is defined such that p A b Bqrri i1, sj j1s Ari, js Bri1, j1s where 0 ď i ă p, 0 ď j ă q, 0 ď i1 ă r, 0 ď j1 ă s. The vectorization of a matrix A P Rmˆn, denoted by vecp Aq, is an mn-dimensional column vector obtained by stacking the columns of A on top of one another. We will usually denote vecp Aq by a. The Frobenius norm of a matrix A P Rmˆn,

denoted by ||A||F , is defined as ||A||F břm i 1 řn j 1 Ari, js2.

The following is a basic lemma about Kronecker products that will be used later:

Lemma 1 (Henderson & Searle (1981)). p A b Bq vecp Gq vecp BGAJq.

2.2 SHAMPOO

The Shampoo optimizer, introduced by Gupta et al. (2018), builds on the principles of the Adagrad algorithm (Duchi et al., 2011). Shampoo can be understood through two primary perspectives: as an extension of Adagrad, and as an approximation of the Gauss Newton component of the Hessian. We explore both perspectives below in Section 2.2.1 and Section 2.2.2 respectively.

1The Gauss Newton component of the Hessian can also be expressed as a covariance matrix. For details, refer Section 2.2.2.

Published as a conference paper at ICLR 2025

2.2.1 ADAGRAD BASED PERSPECTIVE OF SHAMPOO

Adagrad is a preconditioned online learning algorithm that uses the accumulated covariance of the gradients as a preconditioner. Given the model parameters θt P Rp at iteration t, and the gradient gt P Rp of the loss function with respect to θt, Adagrad maintains a preconditioner HAda řT t 1 gtg J t . The parameter update, for a learning rate η, is given by:

θT 1 θT ηH 1{2 Ada g T .

Shampoo extends Adagrad by maintaining a layer-wise Kronecker product approximation of the full Adagrad preconditioner HAda. Let Gt P Rmˆn be the gradient for a weight matrix Wt P Rmˆn at iteration t. The following lemma forms the basis for Shampoo s approximation: Lemma 2 (Gupta et al. (2018)). Assume that G1, . . . , GT are matrices of rank at most r. Let gt vecp Gtq denote the vectorization of Gt for all t. Then, for any ϵ ą 0, we have:

t 1 gtg J t ď

t 1 Gt GJ t

t 1 GJ t Gt

Building on the above lemma, Shampoo maintains two preconditioners, Lt P Rmˆm and Rt P Rnˆn, which are initialized as ϵIm and ϵIn, respectively. The updates for the preconditioners and the Shampoo parameter update, with learning rate η, are given by:

LT LT 1 GT GJ T ; RT RT 1 GJ T GT ; WT 1 WT ηL 1{4 T GT R 1{4 T .

In Lemma 2, Shampoo approximates the Adagrad preconditioner HAda řT t 1 gtg J t using the

Kronecker product řT t 1 Gt GJ t 1{2 b řT t 1 GJ t Gt 1{2 . Our main focus is to study the optimal Kronecker product approximation of HAda and how it relates to Shampoo s approximation.

2.2.2 HESSIAN BASED PERSPECTIVE OF SHAMPOO

In this section, we describe the Hessian approximation viewpoint of Shampoo, explored by previous works (Anil et al., 2021; Osawa et al., 2023), as an alternative to the Adagrad-based perspective. Our theoretical and empirical results apply to both viewpoints.

Gauss Newton (GN) component of the Hessian. For a datapoint px, yq, let fpxq denote the output of a neural network and Lpfpxq, yq represent the training loss. Let W P Rmˆn be a weight matrix in the network and let D denote the training distribution. For cross-entropy (CE) loss, the Gauss Newton component of the Hessian of the loss with respect to W is given by (see Appendix E for details):

HGN Epx,yq D

Bf BW B2L Bf 2 Bf BW

E x Dx s fpxq

gx,sg J x,s ,

Here, fpxq refers to the network s output, and Dx represents the training distribution of x (Pascanu & Bengio, 2014). The right-hand side is commonly referred to as the Fisher matrix, while its counterpart using real labels, Epx,yq D gx,yg J x,y , is called the empirical Fisher. Going forward, for simplicity, we will denote the Fisher matrix as Ex,s fpxq gx,sg J x,s , asumming x is drawn from Dx and similarly for both x and y drawn from D.

Optimizers such as K-FAC and Shampoo, when viewed from the Hessian perspective, perform a layerwise Kronecker product approximation of the Fisher matrix HGN. The following lemma establishes Shampoo s approximation: Lemma 3 (Adapted from Gupta et al. (2018); Anil et al. (2021)). Assume that Gx,s are matrices of rank at most r. Let gx,s vecp Gx,sq . Then, for any ϵ ą 0,

Ex,s fpxq gx,sg J x,s ď r Ex,s fpxq Gx,s GJ x,s 1{2 b Ex,s fpxq GJ x,s Gx,s 1{2 . (1)

In Lemma 2 the matrix on the left hand side is equal to HGN and the right hand side represents the Kronecker product approximation made by Shampoo. However, directly computing this approximation at each step is computationally expensive. In practice, Shampoo applies two additional

Published as a conference paper at ICLR 2025

approximations to make the process more efficient. First, it replaces the per-input gradient by batch gradient, i.e, replacing Ex,s fpxqr Gx,s GJ x,ss with EB,sr GB,s GJ B,ss, where B denotes a batch of data points, s is the concatenation of s fpxq for all px, yq P B and GB,s 1 |B| ř

px,yq PB,s srxs Gx,s is the sampled batch gradient, with srxs representing the sampled label corresponding to x P B. Second, Shampoo replaces sampled labels with real labels, i.e., it replaces EB,sr GB,s GJ B,ss with EBr GBGJ Bs, where GB 1 |B| ř

px,yq PB Gx,y is the batch gradient.

Thus, if Gt and Wt represent the batch gradient and weight matrix at iteration t, and λ is an exponential weighting parameter, then the Shampoo update is given by:

Lt λLt 1 p1 λq Gt GJ t ; Rt λRt 1 p1 λq GJ t Gt; Wt 1 Wt ηL 1{4 t Gt R 1{4 t , where Lt and Rt represent the left and right preconditioners maintained by Shampoo, respectively.

When viewed from the Hessian perspective, our focus is on studying:

The optimal Kronecker product approximation of the matrix HGN and its connection to Shampoo s approximation (detailed in Section 3). The effect of the two aforementioned approximations (batch gradients and real labels) on the quality of the approximation (detailed in Section 4).

2.3 OPTIMAL KRONECKER PRODUCT APPROXIMATION

In this subsection we describe how to find the optimal Kronecker product approximation of a matrix H P Rmnˆmn under the Frobenius norm. This problem can be reduced to finding the best rankone approximation of a rearranged version of H. We define the rearrangement operator reshapepq, applied to a matrix H, as follows: reshapep Hqrmi i1, nj j1s Hrmj i, mj1 i1s,

where i, i1 P r0, 1, . . . , m 1s and j, j1 P r0, 1, . . . , n 1s, and reshapep Hq P Rm2ˆn2. One useful property of the rearrangement operator is:

H A b B ðñ reshapep Hq ab J, (2)

where A P Rmˆm, a vecp Aq P Rm2, B P Rnˆn and b vecp Bq P Rn2. This property can be used to prove the following result on optimal Kronecker product approximation: Lemma 4 (Van Loan & Pitsianis (1993)). Let H P Rmnˆmn and let L P Rmˆn, R P Rnˆm. Then, the Kronecker product approximation of H is equivalent to the rank-one approximation of reshapep Hq under the Frobenius norm:

}H L b R}F } reshapep Hq vecp Lq vecp Rq J}F ,

Since the best rank-one approximation of a matrix is given by its singular value decomposition (SVD), we conclude the following: Corollary 1. Let H P Rmnˆmn. If the top singular vectors and singular value of reshapep Hq are represented by u1, v1 and σ1, respectively, then the matrices L P Rmˆm and R P Rnˆn defined by vecp Lq σ1u1, vecp Rq v1, minimize the Frobenius norm }H L b R}F .

Obtaining SVD by power iteration. Power iteration is a well-known method for estimating the top eigenvalue of a matrix M. It can also be adapted to compute the top singular vectors of a matrix. The iterative procedure for the left singular vector ℓand the right singular vector r is given by

ℓk Ð Mrk 1; rk Ð M Jℓk 1, (3) where k denotes the iteration number.

Cosine similarity. We will use cosine similarity between matrices as a measure of approximation quality. For two matrices M1 and M2, it is defined as

Trp M1M J 2 q }M1}F }M2}F .

A cosine similarity value of 1 indicates perfect alignment, while a value of 0 indicates orthogonality.

Published as a conference paper at ICLR 2025

3 OPTIMAL KRONECKER PRODUCT APPROXIMATION AND SHAMPOO

Loan & Pitsianis (1993) describe an approach to find the optimal Kronecker product approximation of a matrix. Koroko et al. (2023) extend this work to derive layer-wise Kronecker product approximations of the Hessian matrix for networks without weight sharing. In particular, their Proposition 3.1 relies on the rank-1 structure of gradients for a single sample to efficiently compute ˆHGN in non-weight-sharing networks. Our analysis does not rely on any assumptions on the gradients and hence applies to both weight-sharing and non-weight-sharing networks. While our analysis builds on these works, this is restricted to Section 3.1. Our primary contribution (Section 3.2) lies in establishing a novel connection between the square of the Shampoo estimate and the optimal Kronecker approximation.

3.1 OPTIMAL KRONECKER PRODUCT APPROXIMATION

This section applies the theory from Section 2.3 to find the optimal Kronecker product approximation of a covariance matrix H Eg Dgrgg Js for g P Rmn. Both perspectives of Shampoo, described in Section 2.2, focus on Kronecker product approximations of H in the form L b R, where L P Rmˆm and R P Rnˆn, but for different distributions Dg. For the Adagrad viewpoint, Dg is the uniform distribution over gt, where t refers to the gradient at iteration t, giving H HAda. For the Hessian viewpoint, Dg is the distribution over gradients with batch size 1 and sampled labels, leading to H HGN. To simplify notation, we use Ergg Js to represent Eg Dgrgg Js, as our results apply to any distribution Dg. This section explores the optimal Kronecker product approximation for such a generic matrix H, examines its connection to Shampoo, and presents experimental validations for H HAda and H HGN.

Since g P Rmn, each entry of g can be described by a tuple pi, jq P rms ˆ rns. Consequently, each entry of H can be represented by the tuple ppi, jq, pi1, j1qq. We now introduce the matrix ˆH : reshapep Hq P Rm2ˆn2, which is a rearrangement of the entries of H (see Section 2). Using Equation (2), we have: ˆH Er Gb Gs. Furthermore, by Lemma 4, if Lb R is the optimal Kronecker product approximation of H, then ℓr J is the optimal rank-1 approximation of ˆH, where ℓ vecp Lq and r vecp Rq. Thus, the problem reduces to finding the optimal rank-1 approximation of ˆH. Applying the power iteration scheme from Equation (3) to estimate the top singular vectors of ˆH, and using Lemma 1, gives the following updates for the k-th step of power iteration:

ℓk Ð ˆHrk 1 Er G b Gsrk 1 vecp Er GRk 1GJsq,

rk Ð ˆHJℓk 1 Er G b Gs Jℓk 1 vecp Er GJLk 1Gsq. Reshaping the vectors into matrices gives the updates:

Lk Ð Er GRk 1GJs; Rk Ð Er GJLk 1Gs. (4)

3.2 ONE ROUND OF POWER ITERATION

Our primary approximation replaces the full power iteration scheme (Equation (4)) with just a single iteration. This leads to the main contribution of our work: Proposition 1. A single step of power iteration, starting from the identity matrix, for obtaining the optimal Kronecker product approximation of H is exactly equivalent to the square of Shampoo s approximation of H.

Proof. The initialization for this single iteration uses the identity matrix, i.e., Im and In for L and R, respectively. This reduces the iterative update equations:

Lk Ð Er GRk 1GJs; Rk Ð Er GJLk 1Gs,

to the simplified one-step updates:

L Ð Er GGJs; R Ð Er GJGs.

With these expressions for L and R, L b R corresponds exactly to the square of Shampoo s approximation of H given by the right-hand side of Equation (1).

Published as a conference paper at ICLR 2025

As shown in Figure 1 this single step of power iteration closely approximates the optimal Kronecker product approximation for both H HGN (top) and H HAda (bottom). In contrast, the upper bound proposed by the original Shampoo work (Gupta et al., 2018) performs significantly worse.

3.2.1 WHY INITIALIZE WITH THE IDENTITY MATRIX?

MNIST-2 CIFAR-5M Image Net

Gauss Newton

0 5 10 15 20 25 Training Steps

Cosine Similarity

0 2000 4000 6000 8000 10000 Training Steps

0 10000 20000 30000 40000 50000 Training Steps

0 5 10 15 20 25 Training Steps

Cosine Similarity

0 2000 4000 6000 8000 10000 Training Steps

0 10000 20000 30000 40000 50000 Training Steps

Figure 2: Effectiveness of identity matrix initialization in power iteration for Kronecker product approximation. The plots compare σ1 ?ř

i σ2 i , α1σ1 ?ř

i α2 i σ2 i for left (L) and α1σ1 ?ř

i α2 i σ2 i for right (R) singular

vectors across different datasets and architectures. Top row: Gauss Newton component of the Hessian (HGN). Bottom row: Adagrad preconditioner matrix (HAda). Note that α1σ1 ?ř

i α2 i σ2 i consistently

approaches 1 more closely than σ1 ?ř

i σ2 i , supporting our theoretical argument for the effectiveness

of one-step power iteration with identity initialization. See Appendix C.1 for experimental details.

Suppose the SVD of ˆH is given by ˆH ř

i σiuiv J i , or equivalently, H ř

i σi Ui b Vi. The convergence of the power iteration in one step depends on the inner product of the initialization vector with the top singular vector. Let us focus on the left side,2 i.e., the update L Ð Er GGJs, which as described earlier is equivalent to starting with the initialization In. Let vec p Inq ř

i αivi, i.e., In ř

i αi Vi. After one iteration, we obtain ℓ: ř

i αiσiui, and correspondingly, L : ř

i αiσi Ui. We are interested in assessing how closely ℓapproximates the leading eigenvector u1. The cosine similarity between ℓand u1 is given by α1σ1 ?ř

i α2 i σ2 i .

One reason why the cosine similarity might be large is if ˆH is nearly rank-1 (i.e., σ1 is large), meaning H is closely approximated by a Kronecker product. However, as shown in Figure 1, this assumption does not universally hold. Instead, we propose an alternative explanation for why a single step of power iteration is typically sufficient: the coefficient α1 is usually larger than αi for all i ě 2. We provide both a theoretical justification and empirical evidence for this.

We start by noting that αi vec p Inq J vi Trp Viq. Using the identity matrix as initialization is a good choice because it maximizes the dot product with possible top components, i.e., PSD matrices (Proposition 2), and is expected to have a smaller dot product with the later components.

Lemma 5 ( Loan & Pitsianis (1993)). V1 is a Positive Semi-Definite (PSD) matrix.

2The discussion for the right side is analogous.

Published as a conference paper at ICLR 2025

Since V1 is a PSD matrix, we want to initialize our power iteration with a matrix that is close to all PSD matrices. We now show that the identity matrix achieves this, specifically by maximizing the minimum dot product across the set of PSD matrices with unit Frobenius norm. Proposition 2. Consider the set of PSD matrices of unit Frobenius norm of dimension m denoted by Sm. Then 1 ?m Im arg max MPSm min M 1PSmxvecp Mq, vecp M 1qy.

This proposition argues that Im maximizes the worst-case dot product with possible top singular vectors. Now, we argue that its dot product with other singular vectors should be smaller: Lemma 6. If V1 is positive-definite, then Vi for i ě 2 are not PSD.

Therefore, the diagonal elements of Vi for i ě 2 need not be positive, potentially leading to cancellations (for i ě 2) in the trace of Vi, which equals αi. Hence, we expect αi for i ě 2 to be smaller than α1.

To quantify the benefit of α1 being larger than αi for i ě 2, we compare α1σ1 ?ř

i α2 i σ2 i (for both left and

right singular vectors) and σ1 ?ř

i σ2 i . The latter can be interpreted as the cosine similarity if all α s

were equal, or as a measure of how close ˆH is to being rank-1, since it equals the cosine similarity between u1v J 1 and ˆH. Thus, σ1 ?ř

i σ2 i corresponds to the Optimal Kronecker cosine similarity

shown in Figure 1. In Figure 2, we track both quantities during training and observe that α1σ1 ?ř

i α2 i σ2 i is consistently closer to 1 than σ1 ?ř

i σ2 i for both H HGN (top) and H HAda (bottom).

3.2.2 EXACT KRONECKER PRODUCT STRUCTURE IN H

Our analysis shows that Er GGJs b Er GJGs closely approximates the optimal Kronecker product approximation of H. We now show that this approximation becomes exact when H itself is a Kronecker product. In this case, ˆH is rank-1, and a single round of power iteration will perfectly recover ˆH. While our earlier discussion focused on the direction of the top singular vectors of ˆH, the rank-1 assumption allows us to derive an explicit expression for ˆH, and consequently for H.

Corollary 2. Under the assumption that ˆH is rank-1,

H E GGJ b E GJG { Tr E GGJ .

Proof. Let ˆH σuv J, i.e, H σU b V . Let Im Trp Uq U Rm and In Trp V q V Rn, where Rm and Rn are the residual matrices. After one round of power iteration, the left and right estimates provided by Shampoo are given by E GGJ σTrp V q U and E GJG σTrp Uq V . Hence, we can see that Tr E GGJ σ Trp Uq Trp V q. Thus,

H σU b V E GGJ b E GJG { Tr E GGJ .

Since H ˆHGN is an m2 ˆ 1 matrix for binomial logistic regression, it is rank-1, so the equality in the corollary holds. In other words, the square of Shampoo s HGN estimate perfectly correlates with HGN for binomial logistic regression. This is demonstrated in the first plot of Figure 1.

We note that E GGJ b E GJG { Tr E GGJ as an estimate of Hessian (not Adagrad s covariance matrix) was also derived by Ren & Goldfarb (2021). However, their assumptions were much stronger than ours. Specifically, they assume that the gradients follow a tensor-normal distribution, which implies that ˆH is rank-1 and that g is mean-zero (thus not applicable to the Adagrad viewpoint). In contrast, our approach only requires a second moment assumption on the gradients: ˆH is rank-1. This weaker assumption allows our results to be applicable to a broader range of scenarios, including binomial logistic regression. More importantly, our derivation and experiments show that the direction E GGJ b E GJG closely approximates the optimal Kronecker product, even if ˆH is not rank-1.

Published as a conference paper at ICLR 2025

3.2.3 DISCUSSION ABOUT OPTIMIZATION

Our primary goal in this work was to provide a theoretical foundation for understanding Shampoo s effectiveness, rather than proposing a new algorithm. However, the connection we uncovered between Shampoo2 and the optimal Kronecker product approximation has implications for optimization which we discuss below.

Let us refer to Er GGJs b Er GJGs as H1. As mentioned in Equation (1), Gupta et al. (2018) used the approximation H1{2 Er GGJs1{2 b Er GJGs1{2. In practice, the gradient step in Shampoo is taken in the direction of H p 1{2 L, where p is tuned as a hyperparameter (Anil et al., 2021; Shi

et al., 2023). Since H p 1{2 H p{2 1 , searching over p in H p 1{2 yields the same search space as H p 1 , meaning that this distinction does not affect optimization speed but deepens our understanding of Shampoo s mechanism.

In fact, empirical work has demonstrated the practical benefits of Shampoo2, where it has shown improved optimization performance as compared to Shampoo (Anil et al., 2021; Shi et al., 2023). Moreover, Vyas et al. (2024) introduced the SOAP algorithm, incorporating our theoretical insights, including Shampoo2 and the trace correction in Section 3.2.2, and showed improvements compared to Adam W and Shampoo.

4 HESSIAN APPROXIMATION OF SHAMPOO

In this section, we investigate the impact of different practical considerations on the Hessian approximation in Shampoo, building on the insights from Section 2.2.2. Specifically, we examine the effects of averaging gradients across a batch and using real labels instead of sampled labels. These factors influence how well the Shampoo optimizer approximates the Gauss Newton matrix HGN, which we previously evaluated with batch size 1 and sampled labels.

4.1 AVERAGING GRADIENTS ACROSS THE BATCH

The first factor we analyze is averaging gradients across the batch. We transition from computing the gradient on a per-sample basis:

L Ð Ex,s fpxqr Gx,s GJ x,ss; R Ð Ex,s fpxqr GJ x,s Gx,ss

to averaging across a batch B:

L Ð |B|EB,sr GB,s GJ B,ss; R Ð |B|EB,sr GJ B,s GB,ss,

where s denotes the concatenation of s fpxq for all x P B and GB,s 1 |B| ř

x PB,s srxs Gx,s is the batch gradient, with srxs representing the sampled label for to x.

As demonstrated in prior works, this change has no effect in expectation because Gx,s is mean-zero for all x when taking the expectation over s fpxq (Bartlett, 1953), i.e. Esr Gx,ss 0. Lemma 7 (Implicitly in Liu et al. (2024); Osawa et al. (2023)).

|B|EB,sr GB,s GJ B,ss Ex,s fpxqr Gx,s GJ x,ss.

This averaging significantly improves running time, providing a multiplicative speed-up proportional to the batch size.

4.2 USING REAL LABELS INSTEAD OF SAMPLED LABELS

Next, we consider the impact of replacing sampled labels s fpxq with real labels y, as is often done in practice. This shift leads to the empirical Fisher approximation when batch size is 1. Prior work has extensively discussed this approximation and shown that, under certain conditions, the two quantities converge as we move towards optima, in the presence of label noise (Grosse, 2021; Osawa et al., 2023; Kunstner et al., 2019). In Figure 3 (top), we evaluate the approximation of HGN with batch size 1, finding that the quality remains high throughout training. Yet, as batch size increases, the quality degrades because gradients with real labels are not mean-zero. The following shows how the estimator changes with batch size:

Published as a conference paper at ICLR 2025

MNIST-2 CIFAR-5M Image Net

Batch Size 1

0 5 10 15 20 25 Training Steps

Cosine Similarity

0 2000 4000 6000 8000 10000 Training Steps

0 10000 20000 30000 40000 50000 Training Steps

Batch Size 256

0 5 10 15 20 25 Training Steps

Cosine Similarity

0 2000 4000 6000 8000 10000 Training Steps

0 10000 20000 30000 40000 50000 Training Steps

Optimal Kronecker Shampoo2, Sampled Labels Shampoo2, Real Labels

Figure 3: Cosine similarity between approximations of HGN and the true Hessian. The top row shows results for batch size 1, where the empirical Fisher closely tracks the optimal Kronecker product approximation throughout training. In the bottom row (batch size 256), the approximation quality degrades as batch size increases (see Lemma 8). The batch size refers to that used in the Hessian approximation, not for optimization.

Lemma 8 (Grosse (2021)). Let B denote the batch and GB 1 |B| ř

px,yq PB Gx,y denote the batch gradient. Then

EBr GBGJ Bs 1 |B|Ex,yr Gx,y GJ x,ys ˆ 1 1

Ex,yr Gx,ys Ex,yr Gx,ys J.

This lemma shows that, depending on the batch size, the estimator interpolates between Ex,yr Gx,y GJ x,ys (the empirical Fisher) and Ex,yr Gx,ys Ex,yr Gx,ys J. As shown in Figure 3 (top), at batch size 1, when EBr GBGJ Bs is equal to Ex,yr Gx,y GJ x,ys, it closely tracks the optimal Kronecker product approximation. However, with increasing batch sizes (Figure 3, bottom row), the approximation quality begins to degrade.

We note that this adjustment has the computational advantage of not requiring an additional backpropagation with sampled labels; instead, these computations can be performed alongside standard training.

5 CONCLUSION

Our primary contribution is establishing a precise connection between Shampoo s approximation and the optimal Kronecker-factored approximation of matrix H. We prove that the square of Shampoo s approximation is equivalent to one round of the power iteration scheme for obtaining this optimal approximation. Empirically, we verify that this single round closely tracks the optimal Kronecker-factored approximation, significantly outperforming the original Shampoo method. Finally, insights from our work have implications for optimization, with recent research showing improvements over Adam W and Shampoo by incorporating our theoretical findings.

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENTS

NV and DM are supported by a Simons Investigator Fellowship, NSF grant DMS-2134157, DARPA grant W911NF2010021, and DOE grant DE-SC0022199. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. SK and DM acknowledge funding from the Office of Naval Research under award N00014-22-1-2377 and the National Science Foundation Grant under award #IIS 2229881. LJ acknowledges funding from the National Science Foundation DMS-2134157.

Published as a conference paper at ICLR 2025

Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Towards practical second order optimization for deep learning, 2021. URL https://openreview.net/forum?id= Sc8c Y4Jpi3s.

Rohan Anil, Sandra Gadanho, Da Huang, Nijith Jacob, Zhuoshu Li, Dong Lin, Todd Phillips, Cristina Pop, Kevin Regan, Gil I Shamir, et al. On the factory floor: Ml engineering for industrialscale ads recommendation models. ar Xiv preprint ar Xiv:2209.05310, 2022.

Lukas Balles, Fabian Pedregosa, and Nicolas Le Roux. The geometry of sign gradient descent, 2020.

M. S. Bartlett. Approximate confidence intervals. Biometrika, 40(1/2):12 19, 1953. ISSN 00063444. URL http://www.jstor.org/stable/2333091.

Minhyung Cho, Chandra Dhir, and Jaehyung Lee. Hessian-free optimization for learning deep multidimensional recurrent neural networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/ paper/2015/file/a86c450b76fb8c371afead6410d55534-Paper.pdf.

George E. Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo, Juhan Bae, Justin Gilmer, Abel L. Peirson, Bilal Khan, Rohan Anil, Mike Rabbat, Shankar Krishnan, Daniel Snider, Ehsan Amid, Kongtao Chen, Chris J. Maddison, Rakshith Vasudev, Michal Badura, Ankush Garg, and Peter Mattson. Benchmarking neural network training algorithms, 2023.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248 255. Ieee, 2009.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121 2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html.

Sai Surya Duvvuri, Fnu Devvrit, Rohan Anil, Cho-Jui Hsieh, and Inderjit S Dhillon. Combining axes preconditioners through kronecker approximation for deep learning. In The Twelfth International Conference on Learning Representations, 2024.

Runa Eschenhagen, Alexander Immer, Richard E Turner, Frank Schneider, and Philipp Hennig. Kronecker-factored approximate curvature for modern neural network architectures. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=Ex3o JEKS53.

Kai-Xin Gao, Xiao-Lei Liu, Zheng-Hai Huang, Min Wang, Shuangling Wang, Zidong Wang, Dachuan Xu, and Fan Yu. Eigenvalue-corrected natural gradient based on a new approximation, 2020.

Kaixin Gao, Xiaolei Liu, Zhenghai Huang, Min Wang, Zidong Wang, Dachuan Xu, and Fan Yu. A trace-restricted kronecker-factored approximation to natural gradient. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):7519 7527, May 2021. doi: 10.1609/aaai.v35i9. 16921. URL https://ojs.aaai.org/index.php/AAAI/article/view/16921.

Jezabel R Garcia, Federica Freddi, Stathi Fotiadis, Maolin Li, Sattar Vakili, Alberto Bernacchia, and Guillaume Hennequin. Fisher-legendre (fishleg) optimization of deep neural networks. In The Eleventh International Conference on Learning Representations, 2023. URL https:// openreview.net/forum?id=c9l AOPv QHS.

Google Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. https://storage.googleapis.com/deepmind-media/gemini/ gemini_v1_5_report.pdf, 2024. [Online; accessed 19-May-2024].

Published as a conference paper at ICLR 2025

Thomas George, C esar Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast approximate natural gradient descent in a kronecker factored eigenbasis. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/ file/48000647b315f6f00f913caa757a70b3-Paper.pdf.

Roger Grosse. Adaptive gradient methods, normalization, and weight decay. https: //www.cs.toronto.edu/ rgrosse/courses/csc2541_2021/readings/ L05_normalization.pdf, 2021.

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pp. 1842 1850. PMLR, 2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Harold V Henderson and Shayle R Searle. The vec-permutation matrix, the vec operator and kronecker products: A review. Linear and multilinear algebra, 9(4):271 288, 1981.

Diederik P Kingma. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Abdoulaye Koroko, Ani Anciaux-Sedrakian, Ibtihel Ben Gharbia, Val erie Gar es, Mounir Haddou, and Quang Huy Tran. Efficient approximations of the fisher matrix in neural networks using kronecker product singular value decomposition. ESAIM: Proceedings and Surveys, 73:218 237, 2023.

Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_ files/paper/2019/file/46a558d97954d0692411c861cf78ef79-Paper.pdf.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Xi-Lin Li. Preconditioned stochastic gradient descent. IEEE Transactions on Neural Networks and Learning Systems, 29(5):1454 1466, 2018. doi: 10.1109/TNNLS.2017.2672978.

Xi-Lin Li. Stochastic hessian fittings with lie groups, 2024. URL https://arxiv.org/abs/ 2402.11858.

Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, and Alireza Makhzani. Can we remove the square-root in adaptive gradient methods? a second-order perspective. ar Xiv 2402.03496, 2024.

Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=3x HDe A8Noi.

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11976 11986, June 2022.

C. F. Van Loan and N. Pitsianis. Approximation with kronecker products. In Bart L. R. Moor Marc S. Moonen, Gene H. Golub (ed.), Linear Algebra for Large Scale and Real-Time Applications, pp. 293 314. Springer, 1993. URL https://link.springer.com/chapter/10.1007/ 978-94-015-8196-7_17.

Published as a conference paper at ICLR 2025

James Martens. Deep learning via hessian-free optimization. In Johannes F urnkranz and Thorsten Joachims (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML10), June 21-24, 2010, Haifa, Israel, pp. 735 742. Omnipress, 2010. URL https://icml. cc/Conferences/2010/papers/458.pdf.

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408 2417. PMLR, 2015.

James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML 11, pp. 1033 1040, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195.

James Martens, Jimmy Ba, and Matt Johnson. Kronecker-factored curvature approximations for recurrent neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hy MTk QZAb.

MLCommons. Mlc algoperf benchmark competition. https://mlcommons.org/2024/08/ mlc-algoperf-benchmark-competition/, 2024. Accessed: 2024-10-01.

Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers. ar Xiv preprint ar Xiv:2010.08127, 2020.

Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Large-scale distributed second-order optimization using kronecker-factored approximate curvature for deep convolutional neural networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12351 12359, 2019. doi: 10.1109/CVPR.2019.01264.

Kazuki Osawa, Satoki Ishikawa, Rio Yokota, Shigang Li, and Torsten Hoefler. ASDL: A unified interface for gradient preconditioning in pytorch. Co RR, abs/2305.04684, 2023. doi: 10.48550/ ARXIV.2305.04684. URL https://doi.org/10.48550/ar Xiv.2305.04684.

Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5012 5021. PMLR, 09 15 Jun 2019. URL https://proceedings. mlr.press/v97/papyan19a.html.

Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. In Yoshua Bengio and Yann Le Cun (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http: //arxiv.org/abs/1301.3584.

Omead Pooladzandi and Xi-Lin Li. Curvature-informed SGD via general purpose lie-group preconditioners, 2024. URL https://openreview.net/forum?id=sawjx Rn Vp F.

Yi Ren and Donald Goldfarb. Tensor normal training for deep learning models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 26040 26052. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/ file/dae3312c4c6c7000a37ecfb7b0aeb0e4-Paper.pdf.

Adepu Ravi Sankar, Yash Khasbage, Rahul Vigneswaran, and Vineeth N Balasubramanian. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11):9481 9488, May 2021. doi: 10.1609/aaai.v35i11.17142. URL https://ojs.aaai.org/index.php/ AAAI/article/view/17142.

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale, 2023.

Published as a conference paper at ICLR 2025

Charles F Van Loan and Nikos Pitsianis. Approximation with Kronecker products. Springer, 1993.

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam. ar Xiv preprint ar Xiv:2409.11321, 2024.

Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, and Michael Mahoney. Adahessian: An adaptive second order optimizer for machine learning. In proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 10665 10673, 2021.

Published as a conference paper at ICLR 2025

A LIMITATIONS

The main contribution of our work is to demonstrate that the square of Shampoo s approximation of H (whether H refers to HAda or HGN) is nearly equivalent to the optimal Kronecker approximation of H. While we empirically verify this across various datasets and provide theoretical arguments, the gap between the two depends on the problem structure. In some experiments with the Vi T architecture (see Appendix B), we observe that the gap is relatively larger compared to other architectures. Furthermore, it remains an open question to understand the conditions beyond those described in K-FAC (Martens & Grosse, 2015) under which H is expected to be close to a Kronecker product. Again, in some of the Vi T experiments (Appendix B), we find that the optimal Kronecker product approximation to H performs much worse compared to other architectures.

B ADDITIONAL EXPERIMENTAL RESULTS

MNIST-2 CIFAR-5M (Res Net) Image Net

Gauss Newton

0 5 10 15 20 25 Training Steps

Cosine Similarity

0 2000 4000 6000 8000 10000 Training Steps

0 10000 20000 30000 40000 50000 Training Steps

Optimal Kronecker Shampoo2 Shampoo K-FAC

Figure 4: Comparison of different approximations to the Gauss Newton component of the Hessian. The results show that Shampoo2 achieves higher accuracy in capturing the structure of the true Hessian compared to both Shampoo and K-FAC. K-FAC shows competitive performance in some cases, but Shampoo2 generally offers better approximation.

B.1 VIT ARCHITECTURE

In this subsection, we present the results for a Vision Transformer (Vi T) architecture trained on the CIFAR-5m dataset. This architecture features a patch size of 4, a hidden dimension of 512, an MLP dimension of 512, 6 layers, and 8 attention heads.

For these experiments, we utilize three layers from the fourth transformer block: two layers from the MLP (referred to as FFN Linear Layer 1 and FFN Linear Layer 2 ) and the QK layer3 (referred to as Q-K Projection Layer ).

C EXPERIMENTS

Datasets and Architectures. We conducted experiments on three datasets: MNIST (Le Cun et al., 1998), CIFAR-5M (Nakkiran et al., 2020), and Image Net (Deng et al., 2009), using logistic regression, Res Net18 (He et al., 2016), and Conv Ne Xt-T (Liu et al., 2022) architectures, respectively. For MNIST, we subsampled two digits (t0, 1u) and trained a binary classifier.

For MNIST, we used the only layer, i.e., the first layer of the linear classifier, for computing the cosine similarities. For Resnet18 and Imagenet, we picked arbitrary layers. In particular, for Resnet18, we used one of the convolution layers within the first block ( layer1.1.conv1 in Resnet184). For Ima-

3The QK layer is separated from the V part of the layer, following similar decomposition method described by (Duvvuri et al., 2024) 4https://pytorch.org/vision/master/_modules/torchvision/models/resnet. html#resnet18

Published as a conference paper at ICLR 2025

FFN Linear Layer 1 FFN Linear Layer 2 Q-K Projection Layer

Gauss Newton

0 2000 4000 6000 8000 10000 Training Steps

Cosine Similarity

0 2000 4000 6000 8000 10000 Training Steps

0 2000 4000 6000 8000 10000 Training Steps

0 2000 4000 6000 8000 10000 Training Steps

Cosine Similarity

0 2000 4000 6000 8000 10000 Training Steps

0 2000 4000 6000 8000 10000 Training Steps

Optimal Kronecker Shampoo2 Shampoo K-FAC

Figure 5: Analogue of Figure 1 for Vi T architecture and the CIFAR-5m dataset for 3 layers of the network. For some of the figures we observe relatively larger gaps between Shampoo2 and optimal Kronecker approximation.

Table 1: Summary of Experimental Configurations. λ denotes weight decay and β1 indicates momentum.

Dataset Architecture Optimizer Batch Size Steps lr λ β1

MNIST Linear Classifier GD Full Batch 25 0.01 None 0 CIFAR-5M Res Net18 SGD 128 10000 .02 None .9 Image Net Conv Ne Xt-T Adam W 2048 50000 3e-3 5e-3 0.9

genet, we used the 1x1 convolutional layer within the 2nd block of convnext-T ( stages.2.1.pwconv1 in Convnext-T5).

Cosine similarity estimation for HGN. For estimating the Frobenius norm of HGN, we used the identity:

Ev Np0,Idqrv JH2 GNvs Ev Np0,Idqr}HGNv}2 2s }HGN}2 F

Hessian-vector products with the Gauss Newton component were performed using the Deep Net Hessian library provided by Papyan (2019).

For estimating the cosine similarity between HGN and its estimator r HGN, we used the following procedure:

1. Estimate }HGN}F , and calculate } r HGN}F .

2. Define scaled r HGN as r SGN }HGN}F

}Ă HGN}F r HGN.

3. Cos-simp HGN, r HGNq 1 }HGN r SGN}2 F 2}HGN}2 F , where the numerator is again estimated via Hessian-vector products.

5https://pytorch.org/vision/main/models/generated/torchvision.models. convnext_tiny.html#torchvision.models.convnext_tiny

Published as a conference paper at ICLR 2025

Note that in the above procedure, we can exactly calculate } r HGN}F as it is generally of a Kronecker product form with both terms of size m ˆ m or n ˆ n, where m ˆ n is the size of a weight matrix.

Cosine similarity estimation for HAda. We follow a similar recipe as before, but using a difference method for computing the product HAdav. For a given time T, HAda řT t 1 gtg J t . Thus, HAdav řT t 1pg J t vqgt. We maintain this by keeping a running estimate of the quantity for multiple random vectors v during a training run, and use it for estimating the product HAdav.

C.1 FIGURE DETAILS

Optimal Kronecker method, wherever used was computed with five rounds of power iteration, starting from the identity. For H HGN, the Hessian approximations Shampoo2, Shampoo, and K-FAC were done using sampled labels and a batch size of 1. For H HAda and step t, we used gradient enocoutered during the training run in steps ď t.

K-FAC was computed with the reduce variant from Eschenhagen et al. (2023).

In Figure 2, the Optimal Kronecker legend represents the cosine similarity between the optimal Kronecker approximation of HGN and HGN. This is precisely equal to σ1 ?ř

i σ2 i . Similarly, the label

L (resp. R) represents the cosine similarity between the top left (resp. right) singular vector of ˆHGN and the estimate obtained after one round of power iteration starting from In (resp. Im). This is precisely equal to α1σ1 ?ř

i α2 i σ2 i .

In Figure 3 (top), the Hessian approximation is calculated with batch size 1, i.e, |B| 1 in Section 4.2. Similarly, in Figure 3 (bottom), |B| 256.

D DEFERRED PROOFS

Lemma 6. If V1 is positive-definite, then Vi for i ě 2 are not PSD.

Proof. Consider two PSD matrices M1 and M2 having the eigenvalue decomposition M1 ř λ1iq1iq J 1i and M2 ř λ2iq2iq J 2i. Then

Trp M1M2q ÿ

i,j λ1iλ2j q J 1iq2j 2

Thus, if M1 and M2 have unit frobenius norm and M1 is positive definite, then Trp M1M2q ą 0.

Thus, if V1 is positive definite, then by orthogonality of successive singular vectors, Vi for i ě 2 cannot be positive semi-definite.

Proposition 2. Consider the set of PSD matrices of unit Frobenius norm of dimension m denoted by Sm. Then 1 ?m Im arg max MPSm min M 1PSmxvecp Mq, vecp M 1qy.

Proof. Consider the eigendecomposition of any M P Sq given by řq i 1 λiviv J i . Denote L ti : λi ď 1 ?qu. As ř λ2 i 1, therefore, |A| ě 1. Consider any j P A. Then

x V ecp Mq, V ecpvjv J j qy ď 1 ?q

As vj is orthogonal to the other eigenvectors. Thus, we can see

max MPSq min M 1PSqxvecp Mq, vecp M 1qy ď 1 ?q

Published as a conference paper at ICLR 2025

Moreover, for the matrix 1 ?q Iq, for any matrix M 1,

1 ?q x Iq, M 1y trp M 1q ?q

where trp M 1q denotes the trace of the matrix M 1. However, we know trp M 1q ř λi ě 1 as ř λ2 i 1. Thus

1 ?q x Iq, M 1y trp M 1q ?q ě 1 ?q

Note that this is the only matrix with this property as any other matrix will at least have one eigenvalue less than 1 ?q. Thus

1 ?q Iq arg max MPSq min M 1PSqxvecp Mq, vecp M 1qy

Lemma 7 (Implicitly in Liu et al. (2024); Osawa et al. (2023)).

|B|EB,sr GB,s GJ B,ss Ex,s fpxqr Gx,s GJ x,ss.

Proof. Evaluating GB,s GT B,s, we get

GB,s GT B,s 1 |B|2 ÿ

x,x1PB, s srxs,s1 srx1s

Gx,s GJ x1,s1

Taking the expectation over s for a given B, and by using Esr Gx,ss 0 we get

Esr GB,s GT B,ss 1 |B|2 ÿ

x Es fpxqr Gx,s GJ x,ss 1 |B|Ex B,s fpxqr Gx,s GJ x,ss

Now taking an expectation over batches, we get

|B|EB,sr GB,s GT B,ss Ex,s fpxqr Gx,s GT x,ss

Lemma 8 (Grosse (2021)). Let B denote the batch and GB 1 |B| ř

px,yq PB Gx,y denote the batch gradient. Then

EBr GBGJ Bs 1 |B|Ex,yr Gx,y GJ x,ys ˆ 1 1

Ex,yr Gx,ys Ex,yr Gx,ys J.

Proof. Evaluating GBGT B, we get

GBGT B 1 |B|2 ÿ

px,yq,px1,y1q PB Gx,y GJ x1,y1

Taking the expectation over B on both the sides, we get

EB GBGT B 1 |B|2 |B|Ex,yr Gx,y GJ x,ys p|B|2 |B|q Ex,yr Gx,ys Ex,yr Gx,ys J

ùñ EB GBGT B 1 |B|Ex,yr Gx,y GJ x,ys ˆ 1 1

Ex,yr Gx,ys Ex,yr Gx,ys J

Published as a conference paper at ICLR 2025

E TECHNICAL BACKGROUND ON HESSIAN

Gauss Newton (GN) component of the Hessian. For a datapoint px, yq, let fpxq denote the output of a neural network and Lpfpxq, yq represent the training loss. Let W P Rmˆn represent a weight matrix in the neural network and D denote the training distribution. Then, the Hessian of the loss with respect to W is given by

Bf BW B2L Bf 2 Bf BW

Bf B2f BW 2

The first component, for standard losses like cross-entropy (CE) and mean squared error (MSE), is positive semi-definite and is generally known as the Gauss Newton (GN) component (HGN). Previous works have shown that this part closely tracks the overall Hessian during neural network training (Sankar et al., 2021), and thus most second-order methods approximate the GN component. Denoting BLpfpxq,yq

BW by Gx,y P Rmˆn and gx,y vecp Gx,yq, for CE loss, it can also be shown that

HGN Epx,yq D

Bf BW B2L Bf 2 Bf BW

E x Dx s fpxq

gx,sg J x,s ,

F RELATED WORK

We have already discussed two closely related works Koroko et al. (2023); Ren & Goldfarb (2021) in Sections 3.1 and 3.2.2 respectively. We discuss them again below for completeness.

Ren & Goldfarb (2021) study the Hessian perspective of Shampoo and show that, under the assumption that sampled gradients follow a tensor-normal distribution, the square of the Hessian estimate of Shampoo is perfectly correlated with HGN. We also show the same result under much weaker conditions in Corollary 2. Moreover, in Proposition 1 we show that, in general, the square of the Hessian estimate of Shampoo is closely related to the optimal Kronecker product approximation of HGN. We additionally also study the approximations used by Shampoo to make it computationally efficient (Section 4) and the Adagrad perspective of Shampoo s preconditioner.

Loan & Pitsianis (1993) develop the theory of optimal Kronecker product approximation of a matrix (in Frobenius norm). Koroko et al. (2023) use it for finding layer-wise optimal Kronecker product approximation of HGN for a network without weight sharing. We extend their technique to networks with weight-sharing, and show that the square of the Hessian estimate of Shampoo is nearly equivalent to the optimal Kronecker product approximation of HGN.

Another relevant work is Yao et al. (2021), which introduces Ada Hessian, an adaptive second-order optimizer that combines stochastic Hessian diagonal approximations with Adam-style momentum and weighted averaging.

F.1 OTHER RELATED WORKS

The literature related to second order optimization within deep learning is very rich, with methods that can be broadly classified as Hessian-free and methods based on estimating the preconditioner H (which could refer to either HAda or HGN). Hessian-free methods (Martens, 2010) generally tend to approximate the preconditioned step (for Newton s method) using Hessian vector products, but do not maintain an explicit form of the Hessian. Estimating H (Martens & Grosse, 2015; Gupta et al., 2018) methods maintain an explicit form of the preconditioner that could be efficiently stored as well as estimated.

F.2 HESSIAN-FREE

One of the seminal works related to second order optimization within deep learning was the introduction of Hessian-free optimization (Martens, 2010). The work demonstrated the effectiveness of using conjugate gradient (CG) for approximately solving the Newton step on multiple auto-encoder

Published as a conference paper at ICLR 2025

and classifications tasks. Multiple works (Martens & Sutskever, 2011; Cho et al., 2015) have extended this algorithm to other architectures such as recurrent networks and multidimensional neural nets. One of the recent works (Garcia et al., 2023) also takes motivation from this line of work, by approximately using single step CG for every update, along with maintaining a closed form for the inverse of the Hessian, for the single step to be effective. Other recent works (Li, 2018; 2024; Pooladzandi & Li, 2024) have focused on designing iterative preconditioners to improve the convergence specifically for stochastic optimization algorithms.

F.3 ESTIMATING PRECONDITIONER

Given that it is costly to store the entire matrix H, various works have tried to estimate layerwise H. KFAC (Martens & Grosse, 2015) was one of the first work, that went beyond diagonal approximation and made a Kronecker product approximation to layer-wise HGN. It showed that this structure approximately captures the per layer Hessian for MLPs. This approximation was extended to convolutional (Osawa et al., 2019) and recurrent (Martens et al., 2018) architectures. Subsequent works also improved the Hessian approximation, by further fixing the trace (Gao et al., 2021) as well as the diagonal estimates (George et al., 2018; Gao et al., 2020) of the approximation. A recent work (Eschenhagen et al., 2023) also demonstrated that K-FAC can be extended to large-scale training.

From the viewpoint of approximating Adagrad (Duchi et al., 2011), Gupta et al. (2018) introduced Shampoo, that also makes a Kronecker product approximation to HAda. One of the subsequent work (Ren & Goldfarb, 2021) introduced a modification of Shampoo, that was precisely estimating the layer-wise HGN under certain distributional assumptions. Other works (Anil et al., 2021) introduced a distributed implementation of Shampoo, that has recently shown impressive performance for training large scale networks (Shi et al., 2023). Recently, another paper (Duvvuri et al., 2024) proposed a modification of Shampoo, empirically and theoretically demonstrating that the new estimator approximates HAda better than Shampoo s approximation. Our work shows that the square of Shampoo s approximation of HAda is nearly equivalent to the optimal Kronecker approximation.

G COMPARISON WITH EXTRA SQUARE ROOT IN ADAGRAD BASED APPROACHES

Multiple previous works (Balles et al., 2020; Lin et al., 2024) have tried to address the question of why Adagrad-based approaches like Adam and Shampoo, have an extra square root in their update compared to Hessian inverse in their updates. This question is primarily concerned with the final update to the weights being used in the optimization procedure, once we have approximated the Hessian.

The primary contribution of this work is completely orthogonal to this question. We are addressing the question of optimal Kronecker approximation of the Hessian, and its connection to Shampoo s Hessian approximation. This is orthogonal to the Hessian power used in the final update.