# a_computationally_efficient_sparsified_online_newton_method__4faa76d9.pdf

A Computationally Efficient Sparsified Online Newton Method

Devvrit Department of Computer Science The University of Texas at Austin devvrit.03@gmail.com

Sai Surya Duvvuri Department of Computer Science The University of Texas at Austin subramanyamdvss@gmail.com

Rohan Anil Google Deep Mind rohananil@google.com

Vineet Gupta Google vineet@google.com

Cho-Jui Hsieh CS Department, UCLA & Google chohsieh@cs.ucla.edu

Inderjit Dhillon Google isd@google.com

Second-order methods hold significant promise for enhancing the convergence of deep neural network training; however, their large memory and computational demands have limited their practicality. Thus there is a need for scalable second-order methods that can efficiently train large models. In this paper, we introduce the Sparsified Online Newton (SONew) method, a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner. The algorithm emerges from a novel use of the Log Det matrix divergence measure; we combine it with sparsity constraints to minimize regret in the online convex optimization framework. Empirically, we test our method on large scale benchmarks of up to 1B parameters. We achieve up to 30% faster convergence, 3.4% relative improvement in validation performance, and 80% relative improvement in training loss, in comparison to memory efficient optimizers including first order methods. Powering the method is a surprising fact imposing structured sparsity patterns, like tridiagonal and banded structure, requires little to no overhead, making it as efficient and parallelizable as first-order methods. In wall-clock time, tridiagonal SONew is only about 3% slower per step than first-order methods but gives overall gains due to much faster convergence. In contrast, one of the state-of-the-art (SOTA) memory-intensive second-order methods, Shampoo, is unable to scale to large benchmarks. Additionally, while Shampoo necessitates significant engineering efforts to scale to large benchmarks, SONew offers a more straightforward implementation, increasing its practical appeal. SONew code is available at: https://github.com/devvrit/SONew

1 Introduction

Stochastic first order methods which use the negative gradient direction to update parameters have become the standard for training deep neural networks (DNNs). Gradient-based preconditioning involves finding an update direction, by multiplying the gradient with a preconditioner matrix carefully chosen from gradients observed in previous iterations, to improve convergence. (Full-

equal contribution, Work done while at Google

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

matrix) Adagrad [15], online Newton method [25] and natural gradient descent [3] use a full-matrix preconditioner, but computing and storing the full matrix is infeasible when there are millions of parameters. Thus, diagonal versions such as diagonal Adagrad, Adam [33], and RMSprop [28] are now widely used to train DNNs due to their scalability.

Several higher-order methods have previously been applied to deep learning ([24, 5, 23, 38]). All these methods use Kronecker product factorizations that reduce computational and storage costs to make them feasible for training neural networks. However, to precondition a d1 d2 parameter matrix, these methods require matrix inverse operations, which take O(d3 1 + d3 2) time and O(d2 1 + d2 2) space. In comparison, first-order methods use O(d1d2) time and memory, which is linear in the number of parameters. For instance, when d1 = kd2, the memory used by Shampoo, d2 1 + d2 2 floating point numbers is O(k) times the number of parameters, which could be arbitrarily large depending on k. This calls for further research in developing efficient second-order optimization techniques to train DNNs with memory and time complexity linear in the number of parameters.

In this paper, we present a novel Sparsified Online Newton (SONew) method, which only requires linear time and space complexity, to train large-scale DNNs. We derive the algorithm through two steps, classical regret analysis followed by a sparsification step. In more detail, regret analysis when using a preconditioner reveals that the error is bounded by two terms, the first depends on the change in the preconditioning matrix, while the second depends on the generalized gradient norm (see Section 3 for more details). We take a novel approach of minimizing the second term while regularizing two successive preconditioners to be close in the Log Det matrix divergence measure [34] (see Section 3 for the intuition behind choosing Log Det divergence). This analysis naturally yields us an Online Newton method [25]. To make it computationally efficient, we further sparsify the preconditioner by finding a sparse approximation that is close in Log Det divergence. Thus we are consistent in using the same measure (Log Det divergence) in both the regularization and sparsification steps. This gives us our SONew method, which achieves linear complexity by leveraging structured sparsity patterns, such as tridiagonal and banded, in the preconditioner. This is unlike most existing online Newton methods that require quadratic space and cubic time complexity. By making each step linear time, the SONew method can be applied to train modern DNNs as efficiently as first order methods. Further, our method is embarrassingly parallelizable thus making negligible the overhead of computing the preconditioner. We also show that introducing sparsity allows us to reduce the condition number of the problem dynamically to improve numerical stability.

We strengthen the relationship between sparse Log Det divergence minimization and online convex optimization by establishing an optimal O(

T) regret upper bound for tridiagonal sparsity pattern. In our experiments on an MLP Autoencoder and Graph Neural Network (GNN), we found that our method outperformed first-order methods in terms of training loss within the same training time, while Shampoo (second-order method) takes significantly longer. In our experiments on Vision Transformers on Imagenet and GNN on OGBG-molpcba, we achieve a target validation performance using 10% and 30% fewer iterations respectively compared to Adam, the SOTA optimizer for both benchmarks. Furthermore, using the same number of iterations as Adam we observe 0.7% and 3.4% relative improvement for Vi T and GNN respectively in validation performance. From an optimization point of view, SONew achieves 9% and 80% better relative training loss for Vi T and GNN respectively. It is worth noting that Shampoo statistics required 7 #params for Vi T whereas tridiag-SONew uses only 2 #params for its statistics. We also test another recently proposed memory efficient second order optimizer, rfd SON [37], but found its performance suboptimal to the best performing first order method. Owing to SONew s scalability, we train a Large Language Model (LLM) with 1 billion parameters and compare it with Ada Factor [45], a popularly used first order optimizer to train LLMs [11]. SONew achieves the same performance as Ada Factor using 26% fewer steps, resulting in a 1.35 faster training. When using the same number of steps, SONew obtained a 1.7% relative better train loss. In terms of implementation, SONew is just a few lines of code (Equation (13)) without complex engineering challenges, rendering it even more useful and practical.

2 Background

The inner product between matrices is defined as A, B = Tr(AT B), where Tr(.) denotes the matrix trace. The Frobenius norm of a matrix A is A F = p

Tr(AT A), while its spectral norm is A 2 = maxx Ax 2 / x 2. We use In Rn n to denote an identity matrix. We use Sn, S++ n to denote the set of symmetric, and positive definite matrices respectively. The generalized norm of a

vector x Rn with respect to matrix A S++ n is defined as x A =

x T Ax. We use det (A) to denote the determinant of matrix A, and diag(A) to denote the diagonal matrix with diag(A)ii = Aii. We use G and G to denote a graph and its sub-graph with a vertex set [n] = {1, . . . , n}. Let EG denote set of edges in graph G, and neig G(i) denote neighbours of vertex i in graph G. A sparse symmetric matrix A Rn n follows a sparsity structure graph G if Ai,j = 0 (i, j) / EG, . Note that set of all such matrices form a linear subspace. We use Sn(G)++ to denote the set of positive definite matrices with sparsity structure given by graph G, i.e, if X Sn(G)++, then Xij = 0 (i, j) / E(G). Sn(G)++ is an open convex set. Given an index set I = {i1, i2, .., in}, we use AII to denote the corresponding principal sub-matrix of A.

2.1 Log Det matrix divergence

Let ϕ : S++ n R be a strictly convex, differentiable function. The Bregman matrix divergence between X, Y S++ n is defined as [8, 34]: Dϕ(X, Y ) = ϕ(X) ϕ(Y ) Tr( ϕ(Y )T (X Y )). Since ϕ is convex, Dϕ(X, Y ) 0 for all X, Y S++ n . For example if ϕ(X) = X 2 F , the corresponding Bregman divergence Dϕ(X, Y ) = X Y 2 F is the squared Frobenius distance. In this paper, we extensively use the convex function ϕ(X) = log det (X); the corresponding divergence measure Dℓd(X, Y ) is called the Log Det matrix divergence:

Dℓd(X, Y ) = log det XY 1 + Tr(XY 1) n. (1)

The Log Det divergence is scale invariant to invertible matrices A, i.e. Dℓd(AT XA, AT Y A) = Dℓd(X, Y ). Log Det divergence can be written in terms of eigendecompositions of X = V ΣV T and Y = UΘU T [34]:

Dℓd(X,Y )= X

j (v T i uj)2(σi/θj log(σi/θj) 1). (2)

These two properties are later used in Section 3 to highlight the significance of Log Det divergence in our algorithm.

3 SONew: Sparsified Online Newton Method

We now present our proposed algorithm SONew.

3.1 Regret minimization via Log Det divergence

We set up our problem under the online convex optimization framework (OCO) [43, 26], where at each round the learner makes a prediction wt and receives a convex loss ft(wt) and gradient gt = ft(wt) as feedback. The goal of the learner is to reduce regret RT by predicting wt so that a low aggregate loss PT t=1 ft(wt) is achieved compared to the best possible, w = arg minw PT t=1 ft(w). Formally, regret is given by

RT (w1, . . . , w T ) =

t=1 ft(w ).

Using [10], R regret in online setting yields R/T convergence rate in the stochastic setting. To upper bound this regret, we proceed as in [26] by analyzing the error in the iterates for the update wt+1 := wt ηXtgt, where Xt Rn n. Then wt+1 w 2 X 1 t = wt ηXtgt w 2 X 1 t =

wt w 2 X 1 t +η2g T t Xtgt 2η(wt w )T gt. The convexity of ft implies that ft(wt) ft(w )

(wt w )T gt leading to ft(wt) ft(w ) 1 2η( wt w 2 X 1 t wt+1 w 2 X 1 t + η2g T t Xtgt). Summing over all t [T] and rearranging reveals the following upper bound on overall regret:

2η w1 w 2 X 1 1 + η

t=1 g T t Xtgt + 1

t=2 (wt w )T (X 1 t X 1 t 1)(wt w ). (3)

Since w is unknown, finding Xt which minimizes (3) is infeasible. So to minimize regret, we attempt to minimize the second term in (3) while regularizing X 1 t to be close to X 1 t 1. The nearness measure we choose is the Log Det matrix divergence, thus leading to the following objective

Xt = arg min X S++ n g T t Xgt, such that Dℓd (X, Xt 1) ct, (4)

where Dℓd is as in (1). Why do we use the Log Det divergence? From (2), due to the term λi/θj, Dℓd(X, Xt 1) prioritizes matching the smaller eigenvalues of Xt 1 with those of X, i.e., matching the larger eigenvalues of X 1 t 1 and X 1. As a consequence, Log Det divergence regularizes X by matching up its large eigenvalues with those of Xt 1. For example if smallest and largest eigenvalue of Xt 1 are θn and θ1, then for an eigenvalue σ of X, when σ > θn, θ1, the penalty from (2) for θn is higher than for θ1, (σ/θn log(σ/θn) 1) > (σ/θ1 log(σ/θ1) 1). This intuition leads us to formulate (4) as our objective. We recall that there is precedence of using the Log Det divergence in the optimization literature; indeed the celebrated BFGS algorithm [9, 17, 22, 44] can be shown to be the unique solution obtained when the Log Det divergence between successive preconditioners, subject to a secant constraint, is minimized (as shown in the 4-page paper by [18]).

The optimization problem in (4) is convex in X since the Log Det divergence is convex in its first argument. The Lagrangian L(X, λt) = g T t Xgt + λt(Dℓd(X, Xt 1) ct) = Tr(Xgtg T t ) + λt( log det XX 1 t 1 + Tr(XX 1 t 1) n)) λtct. Setting L(X, λt) = 0, and using the fact that log det (X) = X 1 we get the following update rule:

X 1 t = X 1 t 1 + gtg T t /λt. (5)

We emphasize that the update rule (5) arises naturally from our novel use of Log Det divergence to minimize the regret. Moreover, Equation (5) can be seen as a general update rule applicable to numerous existing optimizers. For example, setting ct = 0 (equivalently λt = ) t [n] in (4) results in no change to the preconditioner in any round. In this case, with X0 = In, we get online gradient descent [54]. On the other hand, setting λt = 1 gives the update rule of the online Newton method [25]. Our update rule differs from (full-matrix) Adagrad [15] which has X 2 t = X 2 t 1+gtg T t .

Maintaining and updating Xt as in (5) is possible by using Sherman-Morrison formula but requires O(n2) storage and time complexity. This becomes impractical when n is in the order of millions which is typically the case in DNNs.

3.2 Sparsifying the Preconditioner

To minimize the memory needed for maintaining and updating Xt using (5), we adopt the strategy of sparsifying the preconditioner. For existing optimizers such as (full-matrix) Adagrad or the Online Newton method, it is unclear how to sparsify a given preconditioner. Specifically, there is no intuitive approach to assessing the quality of a sparse preconditioner compared to a full-matrix preconditioner. However, since our update rule (5) originates from using Log Det divergence in the regret bound analysis, it gives us a natural metric to measure the quality of a sparse preconditioner. Let s consider the following problem: find a sparse positive definite X with X 0 αn, α > 1, such that the objective Dℓd(X, (X 1 t 1 + gtg T t /λt) 1) is minimized. Essentially, this problem imposes a sparsity constraint while requiring the sparse preconditioner to remain close to the full-matrix preconditioner in terms of Log Det divergence.

Due to the L0-norm constraint, this is a non-convex problem, which makes it difficult to solve exactly. Since L1-norm serves as a convex relaxation for the L0 norm, we could use it instead, resulting in the following optimization problem also known as graphical lasso estimator [19]:

min X S++ n Dℓd (X, (X 1 t 1 + gtg T t /λt) 1) + γ X 1 .

However, the time taken to solve the above problem, even with the current best methods [7, 29, 16, 53], can still be too large (as these methods take several minutes for a matrix of size million), making it impractical to embed in DNN training.

In this paper, we take a different direction where we use fixed sparsity pattern constraints, specified by a fixed undirected graph G. To sparsify the solution in (5), we formulate the subproblem

Xt = arg min X Sn(G)++ Dℓd (X, (X 1 t 1 + gtg T t /λt) 1), (6)

where Sn(G)++ denotes the set of positive definite matrices with the fixed sparsity pattern corresponding to the adjacency matrix of graph G. Note that both steps (4) and (6) use the same Log Det measure.

Owing to the structure of Log Det divergence, (6) can be surprisingly solved in O(n) and easily parallelizable, for certain sparsity structures G. Algorithm 1 and 2 presents an instantiation of the proposed SONew method, which solves (6) using O(n) time and memory for banded matrices with band size b. In particular a tridiagonal matrix, corresponding to a chain graph, is a banded matrix with bandsize 1.

Algorithm 1 Sparsified Online Newton (SONew) Algorithm

Inputs: λt := coefficient in the update (10), G := sparsity graph (banded/tridiagonal), ϵ := damping parameter, T := total number of iterations/mini-batches, ηt := step size/learning rate. Output: w T +1 1: H0 = ϵId, w1 = 0 2: for t {1, . . . , T} do 3: compute gt = ft(wt) 4: Ht := Ht 1 + PG(gtg T t /λt) Sn(G) with PG as in (8). O(n) time & memory 5: Get L, D = SPARSIFIED_INVERSE (Ht, G), where Xt = LDLT solves (11). 6: Compute descent direction ut = LDLT gt, 7: wt+1 = wt ηtut 8: end for 9: return w T +1

Algorithm 2 SPARSIFIED_INVERSE(H, G) in O(n) flops

Inputs:H Sn(G), is as (10). G := the banded graph of band size b n Outputs: lower triangular banded L Rn n

and diagonal matrix D Rn n

1: function SPARSIFIED_INVERSE(H, G) 2: L := 0, D := 0 3: Ljj := 1, j [n] 4: for j {1, . . . , n} do parallelizable 5: Let Hj Ij and HIj Ij be defined as in Section 2, where Ij = {j + 1, . . . , j + b} [n], 6: Solve for LIjj in the linear system HIj Ij LIjj = HIjj O(b3) time. 7: Djj := 1/(Hjj + HT Ijj LIjj) 8: end for 9: return L, D 10: end function

Maintaining Ht Sn(G) in line 4. Solving the subproblem in (6) naively is impractical since X 1 t 1 is a dense matrix. However, the structure of the Log Det divergence comes to the rescue; the optimization problem in (6) can be expanded as follows:

arg min X Sn(G)++ log det (X)+Tr(X(X 1 t 1 + gtg T t /λt)). (7)

Let us define the projection onto Sn(G), PG : Rn n Rn n as:

PG(M)ij = Mij if (i, j) EG, 0 otherwise. (8)

Note that the Tr(.) term in (7) is dependent only on the non-zero elements of X Sn(G)++, since Tr(AB) = A, B , for symmetric matrices A and B. Hence, (7) can be written as

arg min X Sn(G)++ log det (X) + X, PG(X 1 t 1 + gtg T t /λt) , (9)

Computing the entire matrix X 1 t 1 can be avoided by analyzing the optimality condition of (9). Let g(X) = log det (X) + X, PG(X 1 t 1 + gtg T t /λt) denote the objective function in (9), then the optimality condition of (9) is PG( g(X)) = PG( ( log det(X)+ X, PG(X 1 t 1+gtg T t /λt)) = 0, since gradients with respective nonzero entries of X should be zero, g(X)

Xi,j = ( X(g(X)))i,j = 0, (i, j) EG. Using ( log det(X)) = X 1, X( X, Y ) = Y , and setting X = Xt gives:

PG(X 1 t ) PG(X 1 t 1 + gtg T t /λt) = 0,

Ht = Ht 1 + PG(gtg T t /λt), where Ht = PG(X 1 t ) (10)

Thus we only need to maintain Ht = PG(X 1 t ). This matrix is updated as Ht = Ht 1 + PG(gtg T t /λt). Since Ht Sn(G), the update can be done in O(|EG|) memory and time, while computing the matrix X 1 t would have cost O(n2). In SONew (Algorithm 1), this key observation is used to maintain Ht in line 4.

Computing Xt in line 5. Now that Ht is known at every round t, we can replace PG(X 1 t 1+gtg T t /λt) in (9) with Ht as: Xt = arg min X Sn(G)++ log det (X) + Tr(XHt). (11)

For an arbitrary graph G, solving (11) might be difficult. Theorems 3.1 and 3.2 show embarrassingly parallelizable explicit solutions to the subproblem (11) for tridiagonal and banded sparsity patterns. Theorem 3.1 (Explicit solution of (11) for tridiagonal structures/chain graph). Let the sparsity structure G be a chain with edges EG = {(i, j) : |i j| 1, 1 i, j n}. Also, let H Sn(G) be such that any submatrix of H corresponding to a complete subgraph of G is positive definite, then the solution of (11) is given by ˆX = LDLT , where the unit lower bidiagonal matrix L and diagonal matrix D have the following non-zero entries:

Ljj = 1, Lj+1j = Hj+1j

Hj+1j+1 , D 1 jj = Hjj H2 j+1j Hj+1j+1 , j n 1 & D 1 nn = Hnn (12)

Computing this explicit solution involves conducting paralellizable operations on 2 2 principle submatrices (highlighted in red) of the tridiagonal matrix H to find the ˆX as shown in the following 3 3 example:

H11 H12 0 H21 H22 H23 0 H32 H33

H11 H12 0 H21 H22 H23 0 H32 H33

g2 1 g1g2 0 g1g2 g2 2 g2g3 0 g2g3 g2 3

H22 1 0 0 H32

H11 H2 21 H22 0 0

0 H22 H2 23 H33 0 0 0 H33

H22 0 0 1 H32

Conducting these operations take O(n) time and memory complexity, and similarly the descent direction can be found sequentially by Xtgt = L(D(LT gt)), which can take O(n) time complexity, due to unit lower bidiagonal structure of L, furthermore, these operations can be easily parallelized. We also generalize the explicit solution to banded sparsity structures with band size b. Theorem 3.2 (Explicit solution of (11) for banded structures). Let the sparsity pattern G be a banded matrix of band size b, i.e. EG = {(i, j) : |i j| b, 1 i, j n}. For every vertex j, let Ij = {j + 1, . . . , j + b}. Then Xt = LDLT is the solution of (11) with nonzero entries of L and D defined as follows :

Ljj = 1, LIjj = H 1 Ij Ij HIjj, D 1 jj = (Hjj HT Ijj H 1 Ij Ij HIjj), 1 j n. (14)

where, H Sn(G) any submatrix of H corresponding to a complete subgraph of G is positive definite.

Note that Theorem 3.1 is a special case of Theorem 3.2 when b is set to 1, and the proof for Theorem 3.2 is given in Appendix A.1. Computing the above solution requires solving n linear systems of size b (which is small) as shown in Algorithm 2, and takes O((n b + 1)b3) flops. Since b n, the number of flops is O(n).

3.3 Regret bound analysis of SONew

The following theorem establishes optimal regret guarantee [26] for SONew in the online convex optimization framework mentioned in Section 3.1. Theorem 3.3. When G = tridiagonal/chain graph as defined in Theorem 3.1, then setting ϵ = ˆϵG

t and ηt = D2 ˆϵ n in Algorithm 1, where wt w 2 D2, gt G incurs a regret RT = O( n G D2

The proof sketch involves deriving an explicit expression for entries of X 1 t in Lemma A.2, to upper bound the term (wt w )T (X 1 t X 1 t 1)(wt w ) in regret upper bound (3). Upper bounding η 2 PT t=1 g T t Xtgt involves using the Loewner order Xt Xt 2 In Xt In. A detailed proof sketch and proof is given in Appendix A.2. We note here that though the regret bound presented here is for convex losses, there are connections to non-convex convergence guarantees by using OCO (online convex optimization) learners, presented in Appendix A.2.5. While our main focus is on deep neural network training, which is typically non-convex, we also conducted convex experiments in Table 9.

Optimizer Time complexity

Memory complexity

Adam O(d1d2) d1d2 rfd SON(m) O(m2d1d2) md1d2 Shampoo O(d3 1 + d3 2) (d2 1 + d2 2) tridiag-SONew O(d1d2) 2d1d2 band-4-SONew O(d1d2) 5d1d2

Table 1: Consider preconditioning a d1 d2 parameter matrix. Time complexity of tridiag and banded SONew inverse scale linearly with number of parameters, but, Shampoo is cubic in the dimensions of the matrix. Memory used to store second-moments of gradients by tridiag-SONew can be significantly lower than Shampoo, for e.g. if d1 = 4d2, then Shampoo takes > 2 times more memory.

3.4 Numerical Stability of SONew

In Theorem 3.1 and Theorem 3.2, as mentioned, any submatrix of Ht corresponding to a complete subgraph of G should be positive definite, however, in practice, due to finite precision, each entry of H is inherently perturbed with an error proportional to O(ϵmach), where ϵmach is machine epsilon [27]. We notice in practice that the subtraction operation in D 1 jj = Sjj = Hjj H2 j+1j/Hj+1j+1 (line 7 Algorithm 2), which has a condition number κsub = |Hjj|/|Sjj|, can be high as Sjj can

be arbitrarily low due to near singular submatrices Hii Hii+1 Hi+1i Hi+1i+1

. Thus small perturbation

in H can lead to high perturbations in the preconditioner ˆX. We formalize this notion by deriving an end-to-end componentwise condition number (pg. 135, problem 7.11 in [27]) of SPARSIFIED_INVERSE in Theorem A.10,Appendix A.3. To reduce this condition number upper bound and be robust to perturbations in Ht caused by finite precision, for a tridiagonal graph G, we can remove edges (j, j + 1) which correspond to low Sjj < γ, where γ 0 denotes a tolerance parameter. We show in Theorem A.11,Appendix A.3 that this reduces the condition number upperbound of SPARSIFIED_INVERSE. Furthermore, we generalize this to banded sparsity pattern in Algorithm 3,Appendix A.3.

4 Related Work

Online Newton method is a second order method in online convex optimization framework with properties such as scale invariance [35] and logarithmic regrets in exp-concave and strongly convex functions [25, 26]. However, it has a time complexity of O(n2), making it infeasible for large n. However, introduction of Log Det divergence measure in SONew allows us to set different sparsity graphs as G such as banded graph with band-size b, for which our preconditioning process is more computationally efficient with a time complexity of O(b3(n b + 1)) compared to online-newton method O(n2).

Shampoo [24, 5] approximates full gradient statistics matrix using Kronecker factored preconditioners to reduce the memory and time complexity from O(n2) to O(d2 1 + d2 2) and O(d3 1 + d3 2) respectively. Here, n = d1d2 denotes number of parameters for a linear layer of dimensions d1 d2. The time complexity of matrix inversion takes a heavy toll in Shampoo s compute time even with the Kronecker product assumption on the preconditioner, whereas, our method has a time complexity of O(b3d1d2) quadratic in dimensions of the linear layer (note that b = 1 for tridiagonal structure).

KFAC [38], similar to Shampoo, uses Kronecker factored preconditioning, but to approximate the Fisher-information matrix. Fish Leg [20] instead approximates the inverse Fisher matrix directly by expressing it in terms of the solution to an optimisation problem. Both these methods have memory and time complexity similar to Shampoo. In this work, we compare with Shampoo among the class of Kronecker factored optimizers due to its widespread testing and adoption within the community [46]. We also point the readers to Eva [52], a concurrent work aimed at devising memory efficient optimizer by maintaining rank one matrix approximation to the Kronecker factors of KFAC matrices. For completeness, we include comparison with KFAC, Fish Leg, and Eva on Autoencoder benchmark.

There is prior work [35, 36] in reducing the complexity - O(n2) flops of Online Newton Step (ONS) to O(n) flops using sketching. These ONS variants maintain a low rank approximation of Ht (as in Algorithm 1) and updating it with a new gradient gt at every iteration requires conducting SVD [36]/orthonormalization [35] of a tall and thin matrix in Rn r, where r denotes the rank of approximation of Ht. In Section 5, we conduct large scale experiments and compare SONew against rfd SON [37] as it s more stable than Oja-SON [35].

Table 2: float32 experiments on Autoencoder benchmark. We observe that diag-SONew performs the best among all first order methods while taking similar time. tridiag and band-4 SONew perform significantly better than first order methods while requiring similar linear space and time. Shampoo performs best but takes O(d3 1+d3 2) time for computing preconditioner of a linear layer of size d1 d2, whereas our methods take O(d1d2) time, as mentioned in Section 3.3. rfd SON takes similar space as SONew but performs considerably worse.

Optimizer First Order Methods Second Order Methods

Adagrad RMSProp Adam diag-SONew Shampoo(20) rfd SON(1) rfd SON(4) tridiag-SONew band-4-SONew

Train CE loss 54.393 53.330 53.591 53.025 50.702 56.21 55.55 51.723 51.357

Time(s) 62 62 62 63 371 85 300 70 260

Log Det problem in equation 11 is closely related to the Maximum Determinant Matrix Completion (MDMC) [4, 49]. The MDMC problem is the dual of Log Det problem (11), and has explicit solutions for chordal graphs [4]. Thus the explicit solutions in (14) are the same as the ones proved in [4]. Also, we noticed that the tridiagonal explicit solution has been used previously in KFAC [38] in the context of a gaussian graphical model interpretation of gradients, specifically, KFAC used a block-tridiagonal preconditioner to incorporate correlation within consecutive layers.

5 Experimental Results

We describe our experiments on standard Autoencoder benchmark [42] trained on MNIST dataset [12], Vision Transformer [13] on Imagenet training, Graph Network [6, 21] on OGBGmolpcba dataset [30], and a Large Language Model [47]. For all second order optimizers, we use grafting [2], a technique used to transfer step size between optimization algorithms. Specifically, given an update v1 of Optimizer-1 and v2 of Optimizer-2, grafting allows to use the direction suggested by Optimizer-2 with step size suggested by Optimizer-1. The final update is given by v1 v2 v2. Grafting has been shown to take advantage of a tuned optimizer step size and improve performance. For SONew and rfd SON, we use Adam grafting - using Adam optimizer step size v1 with SONew/rfd SON direction v2/ v2 . For Shampoo, we use its default RMSProp grafting. We couldn t find rfd SON official implementation, so we use our own implementation using which we reproduced the numbers on convex losses (Appendix A.4) reported in their paper [37].

5.1 Autoencoder benchmark

Setup: We use three sparsity patterns for SONew - a) diagonal sparsity, resulting in a diagonal preconditioner similar to adaptive first order methods like Adam and Adagrad; b) tridiagonal sparsity, corresponding to a chain graph; and c) banded sparsity, represented by "band-k" in tables and figures for band size of k. We compare SONew against widely used first order methods including SGD [32]), SGD with Momentum [41], Nesterov [40], Adagrad [14], Adam [33], and Rmsprop [48]. We also compare with rfd SON [37], a recently proposed memory efficient second order optimizer and with Shampoo [24], a state of the art second-order optimizer used in practice, albeit with considerable memory and time requirements. Because of space constraint, we report only the best performing first order methods while include the entire set in the appendix. As previously mentioned, rfd SON maintains a low rank approximation of the Online Newton method s statistics matrix P

i gig T i . We observed rfd SON with adam grafting always performed better than without grafting, hence report the corresponding numbers. We evaluate rfd SON with rank m approximation, denoted as rfd SON(m), which requires (m + 1) #params space when using grafting. For a fair comparison with tridiag-SONew and band-4-SONew, we test rfd SON with m = 1 and m = 4, respectively. For shampoo, computing preconditioner at every step could be infeasible, instead it is computed every t steps - referred to as Shampoo(t). Section 3.3 compares time and memory complexities of rfd SON, Shampoo, tridiag SONew, band-4-SONew. Note that d2 1 + d2 2 2d1d2 d1, d2, thus memory used by tridiag-SONew is never more than Shampoo. We use a 2.72M parameters Autoencoder and each experiment is performed using one V100 GPU having 16 GB memory. Further setup details are given in Appendix A.4.

Results: In Table 2 we observe that among first order methods, diag-SONew performs the best while taking same amount of time. Increasing the number of edges in the sparsity graph to tridiag or banded sparsity with band size 4 enhances the performance further. Tridiag-SONew runs 5 faster than Shampoo at a marginal cost to the loss - even when Shampoo updates preconditioner once every 20 steps. Using same space, rfd SON performs considerably worse than SONew. To test the numerical stability and robustness of SONew, we reduce the precision to bfloat16 and conduct similar

(a) VIT validation error

(b) Graph Network validation avg. precision

Figure 1: (a) Best validation error runs for tridiag-SONew vs Momentum, RMSProp, Adam, rfd SON, and Shampoo on (a) VIT benchmark (b) Graph Network benchmark. We notice that tridiag-SONew achieves same performance as Adam, the next best baseline using similar space and time, with 10% and 30% less steps/time in Vi T and Graph Network respectively. While using the same number of steps, SONew achieves relatively 0.7% and 3.4% better validation error respectively. Shampoo doesn t fit in the 16GB memory of TPU v2 for Vi T benchmark, hence we couldn t perform hyperparameter tuning on it. On Graph Network, compared to Shampoo, tridiag-SONew gives similar performance while being far more memory efficient (Refer Appendix A.4.2 for more details).

Figure 2: (a) Comparison of SONew with first-order optimizers, rfd SON, and Shampoo on Autoencoder benchmark in float32 training. We observe that SONew performs better than all first order methods, and second order methods using the same memory.

Figure 3: Comparison of SONew and Adafactor on LLM training. SONew takes 26% less steps to reach the same performance as Ada Factor. Using the same number of steps, it achieves 1.7% relative better log perplexity.

experiments in Appendix A.4.4 (Table 8 ). We notice that SONew undergoes the least degradation in performance compared to all other optimizers. We refer the reader to Appendix A.4.4 for a thorough comparison and study of bfloat16 experiments. In Figure 2 we plot the loss curves of all the baselines and SONew for float32 experiments. Moreover, in Appendix A.4.1 Table 4 we provide ablation on performance of SONew with varying batch sizes.

Conparison with other baselines: We further compare SONew with KFAC [38], Fish Leg [20], and Eva [52] for completeness. Since these methods lack a JAX implementation, we adopted the authors official Pytorch implementations. When we attempted to integrate their code with our Autoencoder benchmark, the results were unsatisfactory; for instance, Fish Leg recorded a loss of approximately 60.0. This was notably unexpected as it underperformed Adam, a benchmark that the authors themselves compared against. Given these results and to minimize modifications to the official code, we decided to test our optimizer, SONew, directly on their provided autoencoder architecture. We present the results in Appendix A.4.4 and notice that SONew outperforms these baselines as well by a large margin.

5.2 VIT and Graph Network benchmark

Setup: We compare tridiag-SONew with Momentum, RMSProp, and Adam, on VIT ( 22M parameters) and Graph Network ( 3.5M parameters) benchmark. For each experiment, we search

over 200 hyperparameters using 4 16 GB TPUs (v2) for each run. In order to conduct a fair comparison of the running times, we executed the optimal hyperparameter configurations on 4 32GB TPUs (v4) [31]. This is because certain operations, including reshapes and transposes, are not optimized on TPUs (v2). Consequently, methods like rfd SON, Shampoo or SONew, which utilize these operations, could potentially be disadvantaged if TPUs (v2) were used, skewing the comparative results. All memory-efficient methods, including rfd SON, first-order methods, and SONew, exhibit similar runtimes, with differences of approximately 5%. For Vi T, we evaluate their performance based on the same number of steps, as this also effectively compares wall-clock time. However, for Graph Network, we train Shampoo for 20% fewer steps to achieve a comparable wall-clock time.

Results: We plot the runs that give best validation error rate (for VIT) or validation average precision (for Graph Network) in Figure 1. tridiag-SONew requires 10% less steps to reach the same performance as Adam for VIT, and 30% less steps for Graph Network benchmark. Training for the same number of steps, we get 0.7% better relative validation error for VIT and 3.4% better relative validation average precision for Graph Network. On Graph Network benchmark, tridiag SONew performs 1.3% relatively worse in average precision compared to Shampoo, while being 1.25 faster. On VIT benchmark, Shampoo doesn t fit in a 16 GB TPU v2. Its statistics require 155M entries ( 7 #params) while tridiag-SONew requires only 44M entries (2 #params). Hence, we could not tune it. rfd SON takes same memory but slightly more time because of its SVD computations. We also notice rfd SON performs worse than Adam on both benchmarks; we leave a thorough investigation of this behavior as a future work. We show in Appendix A.4 that corresponding to the best validation runs, tridiag-SONew optimizer s training loss is also less than that of Adam. Furthermore, from an optimization point of view we also show in Appendix A.4 that among all the 200 hyperparameter sweeps, the best training loss of tridiag-SONew is 9% relatively better on Vi T and 80% relatively better on Graph NN than that of Adam. We further compare Adam and tridiag-SONew on a 248M parameter Transformer Model in Appendix A.4.4. In next subsection, we present results on a decoder only large scale Language Model.

5.3 Experiments on Language Models

Setup: Owing to SONew s scalability, we test it on a Large Language Model (LLM) [47] with 1 billion parameters. We compare SONew with Ada Factor (without factoring), a commonly used first order optimizer for training LLMs [51, 11]. Ada Factor is similar to Adam except that in addition it offers benefits like "parameter scaling", which has an effect of layerwise damping of the learning rate. We defer the reader to [45] for more details. We trained the LLM for 5B tokens with a batch size of 64k tokens. All experiments were performed on 16 TPU v4s. To support efficient training of large models, we implemented a sharded tridiag-SONew following model parallelism approach.

Results: We report the experiment in Figure 3 where we find that SONew beats Adafactor by a large margin. Specifically, SONew achieves the same log perplexity using 26% less steps. Moreover, using the same number of tokens, SONew achieves 1.7% relative better performance on train loss, leading to 1.35 speedup. This shows the potential of SONew as a scalable optimizer that can be used to train large models while using second order information.

6 Conclusions and Future Work

In this paper we have introduced a novel Sparsified Online Newtwon (SONew) method that yields a computationally efficient sparse preconditioner that can effectively train very large DNNs. The time and memory complexity of SONew is linear in the number of parameters, unlike current Kroneckerfactorization based second-order methods for training deep networks. Our experimental results show that SONew uses similar time as first order methods and achieves much better validation and training performance in various benchmarks. In the future, we plan to explore different sparsity graphs for which efficient solutions exist for the Log Det subproblem (11) and develop corresponding regret bound analyses. Some of the limitations of SONew include: 1) explicit solutions akin to Theorem 3.1 & 3.2 need not exist for all sparsity graphs G; 2) Not all graphs allow for efficient optimizer implementation; 3) Among graphs permitting efficient optimizer implementation like tridiagonal sparsity the ordering of parameters remains unexplored. An alternative ordering might position closely related parameters adjacently, potentially enhancing performance; 4) Comprehensive exploration of methods to scale final updates is needed. While we employ grafting [2], other techniques, such as clipping [45, 38], merit investigation.

[1] N. Agarwal, B. Bullins, X. Chen, E. Hazan, K. Singh, C. Zhang, and Y. Zhang. Efficient full-matrix adaptive regularization. In International Conference on Machine Learning, pages 102 110. PMLR, 2019.

[2] N. Agarwal, R. Anil, E. Hazan, T. Koren, and C. Zhang. Learning rate grafting: Transferability of optimizer tuning, 2022.

[3] S.-I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251 276, 1998.

[4] M. S. Andersen, J. Dahl, and L. Vandenberghe. Logarithmic barriers for sparse matrix cones. Optimization Methods and Software, 28(3):396 423, 2013.

[5] R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning. ar Xiv preprint ar Xiv:2002.09018, 2020.

[6] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, C. Gulcehre, F. Song, A. Ballard, J. Gilmer, G. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu. Relational inductive biases, deep learning, and graph networks, 2018. URL https://arxiv.org/abs/1806.01261.

[7] M. Bollhöfer, A. Eftekhari, S. Scheidegger, and O. Schenk. Large-scale sparse inverse covariance matrix estimation. SIAM Journal on Scientific Computing, 41(1):A380 A401, 2019.

[8] L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200 217, 1967.

[9] C. G. Broyden. Quasi-Newton methods and their application to function minimisation. Mathematics of Computation, 21(99):368 381, 1967.

[10] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. Advances in neural information processing systems, 14, 2001.

[11] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel. Palm: Scaling language modeling with pathways, 2022.

[12] L. Deng. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012.

[13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2020. URL https://arxiv.org/abs/ 2010.11929.

[14] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121 2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html.

[15] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

[16] S. Fattahi and S. Sojoudi. Graphical lasso and thresholding: Equivalence and closed-form solutions. Journal of machine learning research, 2019.

[17] R. Fletcher. A new approach to variable metric algorithms. The computer journal, 13(3): 317 322, 1970.

[18] R. Fletcher. A new variational result for quasi-Newton formulae. SIAM Journal on Optimization, 1(1):18 21, 1991.

[19] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008.

[20] J. R. Garcia, F. Freddi, S. Fotiadis, M. Li, S. Vakili, A. Bernacchia, and G. Hennequin. Fisherlegendre (fishleg) optimization of deep neural networks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= c9l AOPv QHS.

[21] J. Godwin*, T. Keck*, P. Battaglia, V. Bapst, T. Kipf, Y. Li, K. Stachenfeld, P. Veliˇckovi c, and A. Sanchez-Gonzalez. Jraph: A library for graph neural networks in jax., 2020. URL http://github.com/deepmind/jraph.

[22] D. Goldfarb. A family of variable-metric methods derived by variational means. Mathematics of computation, 24(109):23 26, 1970.

[23] D. Goldfarb, Y. Ren, and A. Bahamou. Practical quasi-Newton methods for training deep neural networks. Advances in Neural Information Processing Systems, 33:2386 2396, 2020.

[24] V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1842 1850. PMLR, 2018.

[25] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007.

[26] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

[27] N. J. Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.

[28] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.

[29] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. K. Ravikumar, and R. Poldrack. BIG & QUIC: Sparse inverse covariance estimation for a million variables. Advances in neural information processing systems, 26, 2013.

[30] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: Datasets for machine learning on graphs, 2020. URL https://arxiv.org/abs/ 2005.00687.

[31] N. P. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. Patterson. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, 2023.

[32] J. Kiefer and J. Wolfowitz. Stochastic Estimation of the Maximum of a Regression Function. The Annals of Mathematical Statistics, 23(3):462 466, 1952. doi: 10.1214/aoms/1177729392. URL https://doi.org/10.1214/aoms/1177729392.

[33] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[34] B. Kulis, M. A. Sustik, and I. S. Dhillon. Low-rank kernel learning with Bregman matrix divergences. Journal of Machine Learning Research, 10(2), 2009.

[35] H. Luo, A. Agarwal, N. Cesa-Bianchi, and J. Langford. Efficient second order online learning by sketching. Advances in Neural Information Processing Systems, 29, 2016.

[36] L. Luo, C. Chen, Z. Zhang, W.-J. Li, and T. Zhang. Robust frequent directions with application in online learning. The Journal of Machine Learning Research, 20(1):1697 1737, 2019.

[37] L. Luo, C. Chen, Z. Zhang, W.-J. Li, and T. Zhang. Robust frequent directions with application in online learning. Journal of Machine Learning Research, 20(45):1 41, 2019. URL http: //jmlr.org/papers/v20/17-773.html.

[38] J. Martens and R. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In International conference on machine learning, pages 2408 2417. PMLR, 2015.

[39] S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models, 2016.

[40] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). 1983.

[41] N. Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12 (1):145 151, 1999. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(98)00116-6. URL https://www.sciencedirect.com/science/article/pii/S0893608098001166.

[42] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85 117, jan 2015. doi: 10.1016/j.neunet.2014.09.003. URL https://doi.org/10.1016%2Fj. neunet.2014.09.003.

[43] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012.

[44] D. F. Shanno. Conditioning of quasi-Newton methods for function minimization. Mathematics of computation, 24(111):647 656, 1970.

[45] N. Shazeer and M. Stern. Adafactor: Adaptive learning rates with sublinear memory cost, 2018.

[46] H.-J. M. Shi, T.-H. Lee, S. Iwasaki, J. Gallego-Posada, Z. Li, K. Rangadurai, D. Mudigere, and M. Rabbat. A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale, 2023.

[47] D. R. So, W. Ma nke, H. Liu, Z. Dai, N. Shazeer, and Q. V. Le. Primer: Searching for efficient transformers for language modeling, 2022.

[48] T. Tieleman and G. Hinton. Lecture 6.5 rmsprop: Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. 2012.

[49] L. Vandenberghe, M. S. Andersen, et al. Chordal graphs and semidefinite optimization. Foundations and Trends in Optimization, 1(4):241 433, 2015.

[50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2023.

[51] T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W. Chung, I. Beltagy, J. Launay, and C. Raffel. What language model architecture and pretraining objective work best for zero-shot generalization?, 2022.

[52] L. Zhang, S. Shi, and B. Li. Eva: Practical second-order optimization with kronecker-vectorized approximation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_Mic8V96Voy.

[53] R. Zhang, S. Fattahi, and S. Sojoudi. Large-scale sparse inverse covariance estimation via thresholding and max-det matrix completion. In International Conference on Machine Learning, pages 5766 5775. PMLR, 2018.

[54] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928 936, 2003.

A Supplementary material

A.1 Properties of Log Det subproblem . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

A.2 Regret bound analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2.1 Regret bound decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2.2 Properties of tridiagonal preconditioner . . . . . . . . . . . . . . . . . . . 16

A.2.3 Upperbounding Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

T) Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

A.2.5 Non-convex guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

A.3 Numerical stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

A.3.1 Condition number analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 24

A.3.2 Degenerate Ht . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

A.3.3 Numerically Stable SONew proof . . . . . . . . . . . . . . . . . . . . . . 26

A.4 Additional Experiments, ablations, and details . . . . . . . . . . . . . . . . . . . . 26

A.4.1 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

A.4.2 Memory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

A.4.3 Hyperparaeter search space . . . . . . . . . . . . . . . . . . . . . . . . . . 27

A.4.4 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

A.4.5 Convex experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A.1 Properties of Log Det subproblem

Proof of Theorem 3.2

The optimality condition of (11) is PG(X 1) = PG(H), X S++ n (G). Let Z = L T D 1L 1, then PG(Z) = H

ZL = L T D 1 = ZLej = L T D 1ej Let Jj = Ij j, where Ij = {j + 1, . . . , j + b} as defined in the theorem, then select Jj indices of vectors on both sides of the second equality above and selecting the Jj indices :

Zjj Zj Ij ZIjj ZJj Jj

Note that L T is an upper triangular matrix with ones in the diagonal hence Jth j block of L T ej will be [1, 0, 0, . . .]. Also, since PG(Z) = H Zjj Zj Ij ZIjj ZJj Jj

= Hjj Hj Ij HIjj HJj Jj

Substituting this in the linear equation 15 Hjj Hj Ij HIjj HJj Jj

Hjj Hj Ij HIjj HJj Jj

djj djj LIj

Hjjdjj + djj HT Ijj LIjj = 1

HIjjdjj + djj HIj Ij LIjj = 0

The lemma follows from solving the above equations. Note that here we used that lower triangular halves of matrices L and H have the same sparsity patterns, which follows from the fact that banded

graph is a chordal graph with perfect elimination order {1, 2, . . . , n}. Furthermore, Xt is positive definite, since as (Hjj HT Ijj H 1 Ij Ij HIjj) is a schur complement of submatrix of H formed by Jj = Ij {j}.

Proof of Theorem 3.1 The proof follows trivially from Theorem 3.1, when b is set to 1.

A.2 Regret bound analysis

Proof sketch of Theorem 3.3. We decompose the regret into RT T1 + T2 + T3 in Lemma A.1 and individually bound the terms. Term T2 = 1 2η PT 1 t=1 (wt+1 w )T (X 1 t+1 X 1 t )(wt+1 w ) depends on closeness of consecutive inverses of preconditioners, (X 1 t+1 X 1 t ), to upperbound this we first give explicit expressions of X 1 t for tridiagonal preconditioner in Lemma A.2 in Appendix A.2.2. This explicit expression is later used to bound each entry of (X 1 t+1 X 1 t ) with O(1/

t) in Appendix A.2.4, this gives a O(

T) upperbound on T2. To show an upperbound on T3 = PT t=1 η 2 g T t Xtgt, we individually bound g T t Xtgt by using a Loewner order Xt Xt 2 In Xt In and show that Xt = O(1/

T) and consequently T3 = O(

A.2.1 Regret bound decomposition

In this subsection we state Lemma A.1 which upper bound the regret RT using three terms T1, T2, T3.

Lemma A.1 ( [26] ). In the OCO problem setup, if a prediction wt Rn is made at round t and is updated as wt+1 := wt ηXtgt using a preconditioner matrix Xt S++ n

2η ( w1 w 2 X 1 1 w T +1 w X 1 T ) (16)

t=1 (wt+1 w )T (X 1 t+1 X 1 t )(wt+1 w ) (17)

η 2 g T t Xtgt (18)

wt+1 w 2 X 1 t = wt ηXtgt w 2 X 1 t = wt w 2 X 1 t + η2g T t Xtgt

2η(wt w )T gt

= 2η(wt w )T gt = wt w 2 X 1 t wt+1 w 2 X 1 t + η2g T t Xtgt

Using the convexity of ft, ft(wt) ft(w ) (wt w )T gt, where gt = ft(wt) and summing over t [T]

1 2η wt w 2 X 1 t wt+1 w 2 X 1 t

2 g T t Xtgt (20)

The first summation can be decomposed as follows

wt w 2 X 1 t wt+1 w 2 X 1 t

= w1 w 2 X 1 1 w T +1 w 2 X 1 T

t=1 (wt+1 w )T (X 1 t+1 X 1 t )(wt+1 w )

Substituting the above identity in the Equation (19) proves the lemma.

Let RT T1 + T2 + T3, where

T1 = 1 2η ( w1 w 2 X 1 1 w T +1 w X 1 T )

t=1 (wt+1 w )T (X 1 t+1 X 1 t )(wt+1 w ) (21)

T3 = PT t=1 η 2 g T t Xtgt

A.2.2 Properties of tridiagonal preconditioner

In this subsection, we derive properties of the tridigonal preconditioner obtained from solving the Log Det subproblem (11) with G set to a chain graph over ordered set of vertices {1, . . . , n}:

Xt = arg min X Sn(G)++ log det (X) + Tr(XHt) (22)

= arg min X Sn(G)++ Dℓd (X, H 1 t ) (23)

The second equality holds true only when Ht is positive definite. Although in Algorithm 1 we maintain a sparse Ht = Ht 1 + PG(gtg T t /λt), H0 = ϵIn which is further used in (22) to find the preconditioner Xt, our analysis assumes the full update Ht = Ht 1 + gtg T t /λt, H0 = ϵIn followed by preconditioner Xt computation using (23). Note that the preconditioners Xt generated both ways are the same, as shown in Section 3.2.

The following lemma shows that the inverse of tridiagonal preconditioners used in Algorithm 1, will restore Hi,j, when (i, j) fall in the tridiagonal graph, else, the expression is related to product of Hi+k,i+k+1 corresponding to the edges in the path from node i to j in chain graph. This lemma will be used later in upperbounding T2.

Lemma A.2 (Inverse of tridiagonal preconditioner). If G = chain/tridiagonal graph and ˆX = arg min X Sn(G)++ Dℓd (X, H 1), then the inverse ˆX 1 has the following expression

( ˆX 1)ij =

( Hij |i j| 1 Hii+1Hi+1i+2...Hj 1j

Hi+1i+1...Hj 1j 1 (24)

ˆX 1 ˆX(j) = ej

Where ˆX(j) is the jth column of ˆX. Let ˆY denote the right hand side of Equation (24).

( ˆY ˆX)jj = ˆXjj ˆYjj + ˆXj 1j ˆYj 1j + ˆXjj+1 ˆYjj+1

= ˆXjj Hjj + ˆXj 1j Hj 1j + ˆXjj+1Hjj+1 = 1

The third equality is by using the following alternative form of Equation (12):

( ˆX(1))i,j =

0, if j i > 1 Hi,i+1 (Hii Hi+1,i+1 H2 i+1,i+1), if j = i + 1

j neig G(i) H2 ij Hii Hjj H2 ij

, if i = j , (25)

where i < j. Similarly, the offdiagonals of ˆY ˆX can be evaluated to be zero as follows.

( ˆY ˆX)ij = ˆYij ˆ Xjj + ˆYij 1 ˆXj 1j + ˆYij+1 ˆXj+1j

= ˆYij ˆXjj + ˆYij Hj 1j 1

Hj 1j + ˆYij Hjj+1

Lemma A.3. Let y Rn, β = maxt maxi [n 1] |(Ht)ii+1| / p

(Ht)ii(Ht)i+1i+1 < 1, then

y T X 1 t y y 2 2 diag(Ht) 2

where Xt is defined as in Lemma A.2.

Proof. Let X 1 t = diag(Ht) 1/2 ˆXt diag(Ht) 1/2

y T X 1 t y diag(Ht)1/2y 2

X 1 t 2 (26)

Using the identity of spectral radius ρ(X) X and since X is positive definite, X 1 t 2

X 1 t 2 max i

( X 1 t )ij

1 + 2(β + β2 + . . .)

The second inequality is using Lemma A.2. Substituting this in Equation (26) will give the lemma.

A.2.3 Upperbounding Regret

The following Lemma is used in upperbounding both T1 and T3. In next subsection, we ll upper bound T2 as well.

Lemma A.4. Let β = maxt [T ] maxi [n 1] |(Ht)ii+1| / p

(Ht)ii(Ht)i+1i+1, then

1/(1 β) 8/ˆϵ2,

where, ˆϵ is a constant in parameter ϵ = ˆϵG

T and consequently used in initializing H0 = ϵIn in line 1 in Algorithm 1,

1/(1 β) = max t max i [n 1] 1

1 ( ˆHt)ii+1 (27)

= max t max i [n 1]

1 + ( ˆHt)ii+1

1 ( ˆHt)2 ii+1 (where ( ˆHt)ii+1 = (Ht)ii+1 p

(Ht)ii(Ht)i+1i+1)

max t max i [n 1] 2(Ht)ii(Ht)i+1i+1 (Ht)ii(Ht)i+1i+1 (Ht)2 ii+1 (since |(Ht)ii+1| p

(Ht)ii(Ht)i+1i+1)

max t max i [n 1] 2(Ht)ii(Ht)i+1i+1

det (Ht)ii (Ht)ii+1 (Ht)i+1i (Ht)i+1i+1

Note that (Ht)ii (Ht)ii+1 (Ht)i+1i (Ht)i+1i+1

(using line 1 in Algorithm 1), thus

det (Ht)ii (Ht)ii+1 (Ht)i+1i (Ht)i+1i+1

det ϵ 1 0 0 1

= ϵ2. The numerator last inequality can

be upperbounded by bounding (Ht)ii individually as follows:

s=1 (gs)2 i /λs

s=1 (gs)2 i /λs

s=1 (gs)2 i /(G s)

s=1 G2 /(G s)

Substituting the above in (28) gives

1/(1 β) max t 8G2 t ˆϵ2G2 T

Lemma A.5 (Upperbound of T1).

T ˆϵ2η , (30)

where D2 = maxt [T ] wt w 2 and G = maxt gt

Proof. Since XT is positive definite

T1 w1 w 2 X 1 1 2η

= (y(1))T X 1 1 y(1)

2η (where y(1) = w1 w )

y(1) 2 2 diag(H1) 2

1 β (Lemma A.3)

D2 2(G2 /λ1 + ϵ)

1 β (line 4 in Algorithm 1)

8D2 2(G2 /λ1 + ϵ) ˆϵ2η (Lemma A.4)

8D2 2(G + ˆϵG

T) ˆϵ2η (Since λt = G

t and ϵ = ˆϵG

T ˆϵ2η (ˆϵ < 1)

Lemma A.6 (O(

T) upperbound on T3).

η 2 g T t Xtgt 4n G η

where, gt G and parameters ϵ = ˆϵG

t in Algorithm 1.

Proof. Using Theorem 3.1, nonzero entries of Xt can be written as follows:

(Xt)ii = 1 Hii

H2 ij Hii Hjj H2 ij

(Xt)ii+1 = Hii+1 Hii Hi+1i+1 H2 ii+1

where, EG denote the set of edges of the chain graph G in Theorem 3.1. Also, for brevity, the subscript is dropped for Ht. Let ˆXt = p

diag(H)Xt p

diag(H), then ˆXt can be written as

ˆH2 ij 1 ˆH2 ij

( ˆXt)ii+1 = ˆHii+1 1 ˆH2 ii+1 ,

where, ˆHij = Hij/ p

Hii Hjj. Note that ˆXt ˆXt 2In ˆXt In, using max{|λ1( ˆXt)|, . . . , |λn( ˆXt)|} ˆXt (property of spectral radius). So we upperbound ˆXt = maxi [n]{|( ˆXt)11| + |( ˆXt)12|, . . . , |( ˆXt)ii 1| + |( ˆXt)ii| + |( ˆXt)ii+1|, . . . , |( ˆXt)nn| + |( ˆXt)nn 1|} next. Individual terms |( ˆXt)ii 1| + |( ˆXt)ii| + |( ˆXt)ii+1| can be written as follows:

(i,j) EG |( ˆXt)ij| = 1 + X

ˆH2 ij 1 ˆH2 ij + | ˆHij|

2 max i [n 1] 1

1 | ˆHii+1|

The last inequality is because | ˆHij| 1. Thus, ˆXt 2 maxi [n 1] 1 1 | ˆ Hii+1|. Now

g T t Xtgt g T t diag(Ht) 1/2 ˆXt diag(Ht) 1/2gt

ˆXt diag(Ht) 1/2gt 2 2 ˆXt 2 ˆXt

2 max i [n 1] 1

1 | ˆHii+1| g T t diag(Ht) 1gt.

Using diag(Ht) ϵIn (step 1 in Algorithm 1), where ϵ = ˆϵG

T as set in Lemma A.8, gives

g T t Xtgt 2 max i [n 1] 1

1 | ˆHii+1|

2 max i [n 1] n G ˆϵ(1 | ˆHii+1|)

2n G ˆϵ(1 β)

T (where β = max t [T ] max i [n 1]

( ˆHt)ii+1 )

Summing up over t gives

η 2g T t Xtgt X

T (Using Lemma A.4)

In this section we derive a regret upper bound with a O(T 1/2) growth. For this, we upper bound T2 as well in this section. In (21), T2 = PT t=2(wt w )T (X 1 t X 1 t 1)(wt w ) can be upper bounded to a O(T 1/2) by upperbounding entries of X 1 t X 1 t 1 individually. The following lemmas constructs a telescoping argument to bound (X 1 t X 1 t 1)i,j .

Lemma A.7. Let H, H S++ n , such that H = H + gg T /λ, where g Rn, then

Hii Hjj Hij p

Hii Hjj + Hij p

Hii Hjj Hij p

Hii Hjj ( Hij

Hii Hjj Hij)

Hii Hjj + Hij

The following Lemma bounds the change in the inverse of preconditioner Y 1, when there is a rank one perturbation to H 0 in following Log Det problem (11) :

Y = arg min X Sn(G)++ log det (X) + Tr(XH)

= arg min X Sn(G)++ Dℓd(X, H)

Lemma A.8 (Rank 1 perturbation of Log Det problem (11)). Let H, H S++ n , such that H = H + gg T /λ, where g Rn. Also, Y = arg min X Sn(G)++ Dℓd(X, H) and Y = arg min X Sn(G)++ Dℓd(X, H), where G is a chain graph, then ( Y 1 Y 1)ii+k G2 κ(kβ + k + 2)βk 1/λ,

where i, i + k n, G = g and maxi,j |Hij|/ p

Hii Hjj β < 1. Let κ(diag(H)) := condition number of the diagonal part of H, then κ := max(κ(diag(H)), κ(diag( H))).

Proof. Using Lemma A.2 will give the following: ( Y 1 Y 1)ii+k

Hii+1 . . . Hi+k 1i+k Hi+1i+1 . . . Hi+k 1i+k 1 Hii+1 . . . Hi+k 1i+k Hi+1i+1 . . . Hi+k 1i+k 1

Hii Nii+1 . . . Ni+k 1i+k q

Hii Nii+1 . . . Ni+k 1i+k p

Hii Hi+ki+k Nii+1 . . . Ni+k 1i+k Nii+1 . . . Ni+k 1i+k q

Hii Hi+ki+k/ Hii Hi+ki+k

where Nij = Hij/ p

Hii Hjj < 1 (Since determinants of 2x2 submatrices of H are positive). Expanding Nii+1 = Nii+1 + θii+1 (from Lemma A.7), subsequently Nii+2 = Nii+2 + θii+2 and so on will give

Nii+1 . . . Ni+k 1i+k Nii+1 . . . Ni+k 1i+k

Hii Hi+ki+k

Hii Hi+ki+k

θii+1 Ni+1i+2 . . . Ni+k 1i+k + Nii+1 Ni+1i+2 . . . Ni+k 1i+k Ni+1i+2 . . . Ni+k 1i+k

Hii Hi+ki+k

Hii Hi+ki+k

= |θii+1 Ni+1i+2 . . . Ni+k 1i+k + Nii+1θi+1i+2 Nii+3 . . . Ni+k 1i+k + + Nii+1 . . . Nii+k 1θi+k 1i+k

+ Nii+1 . . . Nii+k

Hii Hi+ki+k

Hii Hi+ki+k

l=0 |θi+li+l+1|)βk 1 + βk 1 1

Hii Hi+ki+k

Hii Hi+ki+k

= ( Y 1 Y 1)ii+k q

Hii Hi+ki+k

l=0 |θi+li+l+1|)βk 1 + βk 1 1

Hii Hi+ki+k

Hii Hi+ki+k

where maxi,j |Ni,j|, maxi,j | Ni,j| β < 1. Expanding θi+li+l+1 from Lemma A.7 in the term

|θi+li+l+1| q

Hii Hi+ki+k will give:

|θi+li+l+1| q

Hii Hi+ki+k

Hii Hi+ki+k gi+lgi+l+1

Hi+li+l Hi+l+1i+l+1 + q

Hii Hi+ki+k Ni+li+l+1

Hi+li+l Hi+l+1i+l+1

Hi+li+l Hi+l+1i+l+1 1

Hii Hi+ki+k gi+lgi+l+1

Hi+li+l Hi+l+1i+l+1

Hii Hi+ki+k Ni+li+l+1

Hi+li+l Hi+l+1i+l+1

Hi+li+l Hi+l+1i+l+1

Since Hi+li+l Hi+l+1i+l+1 Hi+li+l Hi+l+1i+l+1,

Hi+li+l Hi+l+1i+l+1

Hi+li+l Hi+l+1i+l+1 max 1 Hi+li+l

Hi+li+l , 1 Hi+l+1i+l+1

Hi+l+1i+l+1

max g2 i+l λ Hi+li+l , g2 i+l+1 λ Hi+l+1i+l+1

Using the above, Hi,i/Hj,j κ, and |gi| G , i, j [n], gives q

Hii Hi+ki+k|θi+li+l+1| G2 κ/λ + βG2 κ/λ

G2 κ(1 + β)/λ

Thus the following part of Y 1 Y 1

can be upperbounded:

Hii Hi+ki+k

l=0 |θi+li+l+1|)βk 1 !

G2 κ(1 + β)kβk 1/λ

Hii Hi+ki+kβk 1 1 q

Hii Hi+ki+k

Hii Hi+ki+k

βk 1κG2 /λ, so Y 1 Y 1

G2 κ(kβ + k + 2)βk 1/λ

Lemma A.9 (O(

T) upper bound of T2). Given that κ(diag(Ht)) κ, wt w 2 D2, maxi,j |(Ht)ij|/ p

(Ht)ii(Ht)jj β < 1, t [T] in Algorithm 1, then T2 in Appendix A.2.1 can be bounded as follows:

T ηˆϵ5 (G D2 2)

where λt = G

t, and ϵ = ˆϵG

T in Algorithm 1, and ˆϵ 1 is a constant.

Proof. Note that T2 = 1 2η PT 1 t=1 (wt+1 w )T (X 1 t+1 X 1 t )(wt+1 w ) PT 1 t=1 D2 2 (X 1 t+1 X 1 t ) 2 /(2η). Using A 2 = ρ(A) A for symmetric matrices A, we get X 1 t+1 X 1 t 2 X 1 t+1 X 1 t

= max i ( X

(X 1 t+1 X 1 t )ij )

t(1 β)2 (Lemma A.8 )

Now using κ 2/ˆϵ (using Equation (29) and (Ht)ii > ˆϵ) and summing up terms in T2 using the above will give the result.

Putting together T1, T2 and T3 from Lemma A.5, Lemma A.9 and Lemma A.6 respectively, when ϵ, λt are defined as in Lemma A.9:

T ηˆϵ5 (G D2 2) (31)

Setting η = D2 ˆϵ n

RT T1 + T2 + T3 O( n G D2

A.2.5 Non-convex guarantees

Minimizing smooth non-convex functions f is a complex yet interesting problem. In Agarwal et al. [1], this problem is reduced to an online convex optimization, where a sequence of objectives ft(w) = f(w) + c w wt 2 2 are minimized. Using this approach Agarwal et al. [1] established convergence guarantees to reach a stationary point via regret minimization. Thus non-convex guarantees can be obtained from regret guarantees and is our main focus in the paper.

A.3 Numerical stability

In this section we conduct perturbation analysis to derive an end-to-end componentwise condition number (pg. 135, problem 7.11 in [27]) upper bound of the tridiagonal explicit solution in Theorem 3.1. In addition to this, we devise Algorithm 3 to reduce this condition number upper bound for the tridiagonal sparsity structure, and be robust to Ht which don t follow the non-degeneracy condition: any principle submatrix of Ht corresponding to a complete subgraph of G.

Theorem A.10 (Condition number of tridiagonal Log Det subproblem (11)). Let H S++ n be such that Hii = 1 for i [n]. Let H be a symmetric perturbation such that Hii = 0 for i [n], and H + H S++ n . Let PG(H) be the input to 11, where G is a chain graph, then

κℓd max i [n 1] 2/(1 β2 i ) = ˆκℓd , (33)

where, βi = Hii+1,κℓd := componentwise condition number of (11) for perturbation H.

The tridiagonal Log Det problem with inputs H as mentioned in Theorem A.10, has high condition number when 1 β2 i = Hii H2 ii+1/Hi+1i+1 are low and as a result the preconditioner Xt in

SONew (Algorithm 1) has high componentwise relative errors. We develop Algorithm 3 to be robust to degenerate inputs H, given that Hii > 0. It finds a subgraph G of G for which non-degeneracy conditions in Theorem 3.2 is satisfied and (14) is well-defined. This is done by removing edges which causes inverse H 1 Ij Ij to be singular or (Hjj HT Ijj H 1 Ij Ij HIjj) to be low. In the following theorem we also show that the condition number upper bound in Theorem A.10 reduces in tridiagonal case. To test the robustness of this method we conducted an ablation study in Table 5, in an Autoencoder benchmark (from Section 5) in bfloat16 where we demonstrate noticeable improvement in performance when Algorithm 3 is used.

Theorem A.11 (Numerically stable algorithm). Algorithm 3 finds a subgraph G of G, such that explicit solution for G in (14) is well-defined. Furthermore, when G is a tridiagonal/chain graph, the component-wise condition number upper bound in (33) is reduced upon using Algorithm 3, ˆκ G ℓd < ˆκG ℓd, where ˆκ G ℓd, ˆκG ℓd are defined as in Theorem A.10 for graphs G and G respectively.

The proofs for Theorems A.10 and A.11 are given in the following subsections.

Algorithm 3 Numerically stable banded Log Det solution

1: Input: G tridiagonal or banded graph, H symmetric matrix in Rn n with sparsity structure G and Hii > 0, γ tolerance parameter for low schur complements. 2: Output: Finds subgraph G of G without any degenerate cases from Lemma A.13 and finds preconditioner ˆ X corresponding to the subgraph 3: Let Ei = {(i, j) : (i, j) EG} be edges from vertex i to its neighbours in graph G. 4: Let V + i = {j : i < j, (i, j) EG} and V i = {j : i > j, (i, j) EG}, denote positive and negative neighbourhood of vertex i. 5: Let K = n i : Hii HT Iii H 1 Ii Ii HIii is undefined or γ o

6: Consider a new subgraph G with edges E G = EG \ (S

i K Ei (V + i V i )) 7: return ˆ X := SPARSIFIED_INVERSE ( Ht, G), where Ht = P G(Ht)

A.3.1 Condition number analysis

Theorem A.12 (Full version of Theorem A.10). Let H S++ n such that Hii = 1, for i [n] and a symmetric perturbation H such that Hii = 0, for i [n] and H + H 0. Let ˆX = arg min X Sn(G)++ Dℓd X, H 1 and ˆX + ˆX = arg min X Sn(G)++ Dℓd X, (H + H) 1 , here G := chain/tridiagonal sparsity graph and Sn(G)++ denotes positive definite matrices which follows the sparsity pattern G.

κℓd = lim ϵ 0 sup

ϵ ˆXij : | Hk,l| |ϵHk,l| , (k, l) EG

max i [n 1] 1/(1 β2 i )

where, κℓd := condition number of the Log Det subproblem, κ2(.) := condition number of a matrix in ℓ2 norm, βi = Hii+1/ p

Hii Hi+1i+1

Proof. Consider the offdiagonals for which ( ˆX + ˆX)ii+1 = Hii+1/(1 H2 ii+1) = f(Hii+1),where f(x) = x/(1 x2). Let y = f(x), ˆy = f(x + x) and | x/x| ϵ then using Taylor series (ˆy y)

Using the above inequality, with x := Hii+1 and y := ˆXii+1,

1 + H2 ii+1 1 H2 ii+1 (34)

2 1 H2 ii+1 Let g(x) = x2/(1 x2), let y1 = g(w1), y2 = g(x2), ˆy1 = g(w1 + x), ˆy2 = g(x2 + x). Using Taylor series (ˆy1 y1)

+ O(( x1)2) (ˆy2 y2)

+ O(( x2)2)

= lim ϵ 0 y1 + y2 ϵ(1 + y1 + y2) max 2 1 x2 1 , 2 1 x2 2

Putting x1 := Hii+1, x2 := Hii 1 and analyzing y1 := H2 ii+1/(1 H2 ii+1) and y2 := H2 ii 1/(1 H2 ii 1) will result in the following

max 2 1 H2 ii+1 , 2 1 H2 ii 1

Since ˆXii = 1 + H2 ii+1/(1 H2 ii+1) + H2 ii 1/(1 H2 ii 1). Putting together Equation (35) and Equation (34), the theorem is proved.

A.3.2 Degenerate Ht

In SONew (1), the Ht = PG(Pt s=1 gsg T s /λt) generated in line 4 could be such that the matrix Pt s=1 gsg T s /λt need not be positive definite and so the schur complements Hii H2 ii+1/Hi+1i+1 can be zero, giving an infinite condition number κℓd by Theorem A.10. The following lemma describes such cases in detail for a more general banded sparsity structure case. Lemma A.13 (Degenerate inputs to banded Log Det subproblem). Let H = PG(GGT ), when ϵ = 0 in Algorithm 1, where G Rn T and let g(i) 1:T be ith row of G, which is gradients of parameter i for

T rounds, then Hij = D g(i) 1:T , g(j) 1:T E .

Case 1: For tridiagonal sparsity structure G: if g(j) 1:T = g(j+1) 1:T , then Hjj H2 jj+1/Hj+1j+1 = 0.

Case 2: For b > 1 in (14): If rank(HJj Jj) = rank(HIj Ij) = b, then (Hjj HT Ijj H 1 Ij Ij HIjj) = 0 and Djj = . If rank(HIj Ij) < b then the inverse H 1 Ij Ij doesn t exist and Djj is not well-defined.

Proof. For b = 1, if g(j) 1:T = g(j+1) 1:T , then Hjj+1 = Hjj = Hj+1j+1 = g(j) 1:T 2

2, thus Hjj

H2 jj+1/Hj+1j+1 = 0. For b > 1, since HIj Ij, using Guttman rank additivity formula, rank(Hjj H2 jj+1/Hj+1j+1) = rank(HJj Jj) rank(HIj Ij) = 0, thus Hjj H2 jj+1/Hj+1j+1 = 0. Furthermore, if rank(H) b, then all b + 1 b + 1 principal submatrices of H have rank b, thus j, HJj Jj have a rank b, thus Djj for all j are undefined.

If GGT = PT i=1 gigi is a singular matrix, then solution to the Log Det problem might not be welldefined as shown in Lemma A.13. For instance, Case 1 can occur when preconditioning the input layer of an image-based DNN with flattened image inputs, where jth and (j + 1)th pixel can be highly correlated throughout the dataset. Case 2 can occur in the first b iterations in Algorithm 1 when the rank of submatrices rank(HIj Ij) < b and ϵ = 0.

Table 3: float32 experiments on Autoencoder benchmark using different band sizes. Band size 0 corresponds to diag-SONew and 1 corresponds to tridiag-SONew. We see the training loss getting better as we increase band size

Band size 0 (diag-SONew) 1 (tridiag-SONew) 4 10

Train CE loss 53.025 51.723 51.357 51.226

A.3.3 Numerically Stable SONew proof

Proof of Theorem A.11

Let Ii = {j : i < j, (i, j) EG} and I i = j : i < j, (i, j) E G Let K = i : Hii HT Iii H 1 Ii Ii HIii is undefined or 0, i [n] denote vertices which are getting removed by the algorithm, then for the new graph G, Dii = 1/Hii, i K since Hii > 0. Let K = i : Hii HT Iii H 1 Ii Ii HIii > 0, i [n] . Let for some j K, if

l = arg min {i : j < i, i K Ij} ,

denotes the nearest connected vertex higher than j for which Dll is undefined or zero, then according to the definition E G in Algorithm 3, I j = {j + 1, . . . l 1} Ij, since Djj is well-defined, HIj Ij is invertible, which makes it a positive definite matrix (since H is PSD). Since Hjj HT Ijj H 1 Ij Ij HIjj > 0, using Guttman rank additivity formula HJj Jj 0, where Jj = Ij j. Since HJ j J j is a submatrix of HJj Jj, it is positive definite and hence its schur complement Hjj HT I jj H 1 I j I j HI jj > 0. Thus

for all j [n], the corresponding Djj s are well-defined in the new graph G.

Note that κ G ℓd = maxi [n 1] 1/(1 β2 i ) < maxi K 1/(1 β2 i ) = κG ℓd, for tridiagonal graph, where βi = Hii+1, in the case where Hii = 1. This is because the arg maxi [n 1] 1/(1 β2 i ) K.

A.4 Additional Experiments, ablations, and details

A.4.1 Ablations

Effect of band size in banded-SONew Increasing band size in banded-SONew captures more correlation between parameters, hence should expectedly lead to better preconditioners. We confirm this through experiments on the Autoencoder benchmark where we take band size = 0 (diag-SONew), 1 (tridiag-SONew), 4, and 10 in Table 3.

Effect of mini-batch size To find the effect of mini-batch size, in Table 4, We empirically compare SONew with state of the art first-order methods such as Adam and RMSProp, and second-order method Shampoo. We see that SONew performance doesn t deteriorate much when using smaller or larger batch size. First order methods on the other hand suffer significantly. We also notice that Shampoo doesn t perform better than SONew in these regimes.

Table 4: Comparison on Autoencoder with different batch-sizes

Baseline\Batch size 100 1000 5000 10000

RMSProp 55.61 53.33 58.69 64.91

Adam 55.67 54.39 58.93 65.37

Shampoo(20) 53.91 50.70 53.52 54.90

tds 53.84 51.72 54.24 55.87

bds-4 53.52 51.35 53.03 54.89

Effect of Numerical Stability Algorithm 3 On tridiag-SONew and banded-4-SONew, we observe that using Algorithm 3 improves training loss. We present in Table 5 results where we observed significant performance improvements.

Table 5: bfloat16 experiments on Autoencoder benchmark with and without Algorithm 3. We observe improvement in training loss when using Algorithm 3

Optimizer Train CE loss - without Algorithm 3 Train CE loss - with Algorithm 3

tridiag-SONew 53.150 51.936

band-4-SONew 51.950 51.84

Table 6: A rough estimate of memory requirement comparisons of different optimizers tested across benchmarks.

Benchmark # model parameters K-FAC Shampoo Fish Leg Eva Adam SGD+Momentum RMSprop tds-SONew

Autoencoder n=1.4M 5.56n 6.56n 4.28n n 2n n n 3n

Graph Network n=3.5M 8.6n 10.6n 4.8n n 2n n n 3n

Vision Transformer n=22M 6.4n 7.2n 3.7n n 2n n n n

Language Model n=1.3B 5.6n 6.6n 3.3n n 2n n n 3n

A.4.2 Memory Requirements

We present a list of approximate memory requirements of different optimizers across different benchmarks in Table 6. Note that for K-FAC and Shampoo, because preconditioner is updated once only a few steps, they require storing the latest computed preconditioners as well along with the statistics, causing even higher memory overhead.

A.4.3 Hyperparaeter search space

We provide the hyperparamter search space for experiments presented in Section 5. We search over 2k hyperparameters for each Autoencoder experiment using a Bayesian Optimization package. The search ranges are: first order momentum term β1 [1e 1, 0.999], second order momentum term β2 [1e 1, 0.999], learning rate [1e 7, 1e 1], ϵ [1e 10, 1e 1]. We give the optimal hyperparameter value for each experiment in Table 12. For VIT and Graph Network benchmark, we search β1, β2 [0.1, 0.999], lr [1e 5, 1e 1], ϵ [1e 9, 1e 4], weight decay [1e 5, 1.0], learning rate warmup [2%, 5%, 10%] total_train_steps, dropout [00, 0.1], label smoothing over {0.0, 0.1, 0.2} . We use cosine learning rate schedule. Batch size was kept = 1024, and 512 for Vision Transformer, and Graph Network respectively. We sweep over 200 hyperparameters in the search space for all the optimizers. For rfd SON [37], there s no ϵ hyperparameter. In addition to the remaining hyperparameters, we tune α {1e 5, 1.0} (plays similar role as ϵ) and µt [1e 5, 0.1]. For LLM [47] benchmark, we only tune the learning rate [1e 2, 1e 3, 1e 4] while keeping the rest of the hyperparams as constant. This is due to the high cost of running experiments hence we only tune the most important hyperparameter. For Adafactor [45], we use factored=False, decay method=adam, β1 = 0.9, weight decay=1e 3, decay factor=0.99, and gradient clipping=1.0.

A.4.4 Additional Experiments

VIT and Graph Network Benchmarks: In Figure 5 we plot the training loss curves of runs corresponding to the best validation runs in Figure 1. Furthermore, from an optimization point of view, we plot the best train loss runs in Figure 6 got by searching over 200 hyperparameters. We find that tridiag-SONew is 9% and 80% relatively better in Vi T and Graph Network benchmark respectively (Figure 6), compared to Adam (the next best memory efficient baseline).

Autoencoder float32 and bfloat16 experiments: We provide curves of all the baselines and SONew in Figure 4(a) and the corresponding numbers in Table 7 for float32 experiments.

To test numerical stability of SONew and compare it with other algorithm in low precision regime, we also conduct bfloat16 experiments on the Autoencoder benchmark (Table 8). We notice that SONew undergoes the least degradation. Tridiagonal-sparsity SONew CE loss increases by only 0.21 absolute difference (from 51.72 in float32 (7) to 51.93), whereas Shampoo and Adam incur 0.70 loss increase. It s worthwhile to note that SONew performs better than all first order methods while taking similar time and linear memory, whereas while Shampoo performs marginally better, it is 22 slower

Table 7: float32 experiments on Autoencoder benchmark. We observe that diag-SONew performs the best among all first order methods while taking similar time. tridiag and band-4 perform significantly better than first order methods while requiring similar linear space and time. Shampoo performs best but takes O(d3 1 + d3 2) time for computing preconditioner of a linear layer of size d1 d2, whereas our methods take O(d1d2) time, as mentioned in Section 3.3. rfd SON takes similar space as SONew but performs considerably worse.

Optimizer First Order Methods

SGD Nesterov Adagrad Momentum RMSProp Adam diag-SONew

Train CE loss 67.654 59.087 54.393 58.651 53.330 53.591 53.025

Time(s) 62 102 62 67 62 62 63

Optimizer Second Order Methods

Shampoo(20) rfd SON(1) rfd SON(4) tridiag-SONew band-4-SONew

Train CE loss 50.702 53.56 52.97 51.723 51.357

Time(s) 371 85 300 70 260

Table 8: bfloat16 experiments on Autoencoder benchmark to test the numerical stability of SONew and robustness of Algorithm 3. We notice that diag-SONew degrades only marginally (0.26 absolute difference) compared to float32 performance. tridiag-SONew and band-4-SONew holds similar observations as well. Shampoo performs the best but has a considerable drop (0.70) in performance compared to float32 due to using matrix inverse, and is slower due to its cubic time complexity for computing preconditioners. Shampoo implementation uses 16-bit quantization to make it work in 16-bit setting, leading to further slowdown. Hence the running time in bfloat16 is even higher than in float32.

Optimizer First Order Methods

SGD Nesterov Adagrad Momentum RMSProp Adam diag-SONew

Train CE loss 80.454 72.975 68.854 70.053 53.743 54.328 53.29

Train time(s) 36 43 37 36 37 38 44

Optimizer Second Order Methods

Shampoo(20) rfd SON(1) rfd SON(4) tridiag-SONew band-4-SONew

Train CE loss 51.401 57.42 55.53 51.937 51.84

Train time(s) 1245 80 284 55 230

(a) float32 - autoencoder

(b) bfloat16 - autoencoder

Figure 4: Training curves of all the baselines for Autoencoder benchmar (a) float32 training (b) bfloat16 training

than tridiagonal-SONew. The corresponding loss curves are given in Figure 4(b). Note: In the main paper, our reported numbers for rfd SON on Autoencoder benchmark in Table 2 for float32 experiments are erraneuous. Please consider the numbers provided in Table 7 and the corresponding curve in Figure 4(a). Note that there s no qualitiative change in the results and none of the claims made in the paper are affected. SONew is still significantly better than rfd SON. We also meticulously checked all other experiments, and they do not have any errors.

Autoencoder on KFAC, Fish Leg, Eva: For completeness, We compare SONew against KFAC [38], Fish Leg [20], and Eva [52] on Autoencoder benchmark as used in their official implementation.

(a) VIT train CE loss

(b) Graph Network train CE loss

Figure 5: Train loss corresponding to the best validation runs in Figure 1 (a) VIT benchmark (b) Graph Network benchmark. We observe that tridiag-SONew match or perform better than Adam.

(a) Best VIT train CE loss

(b) Best Graph Network train CE loss

Figure 6: Best train loss achieved during hyperparam tuning. (a) VIT benchmark (b)Graph Network benchmark. We observe that tridiag-SONew significantly outperforms Adam, while being comparable or better than shampoo.

Figure 7: Autoencoder benchmark run using Pytorch on KFAC, Fish Leg, Eva, and tridiag-SONew. We notice that tridiag-SONew beats all other baselines by a large margin.

The main difference is their implementation uses Re LU activation compared to Tanh that we used for all our Autoencoder experiments. As these baselines done have JAX implementation, we use their official Py Torch implementation and run tridiag-SONew in Py Torch as well. Hyperparameter search is conducted for SONew similar to as reported above, over learning rate, β1, β2, and ϵ. For KFAC and Eva, rather than β2, damping factor is tuned over [1e 5, 10] (default value specified is 0.03). kl_clip is tuned as well over [1e 5, 1.0]. Preconditioners are updates once every 15 iterations to have same wall clock time as other baselines and SONew. For Fish Leg, auxiliary learning rate is tuned [1e 7, 1e 1] and damping [1e 5, 1.0]. All other hyperparameters are tuned similar to SONew. Eva is trained for 100 epochs, and for other methods we change number of epochs such that

(a) Validation CE loss

(b) Train CE loss

Figure 8: We observe mixed results for 248M parameter language model benchmark. This is possibly because 48 trials were insufficient for optimal tuning. We leave thorough tuning and investigating into the above observation as future work.

each experiment takes same amount of time. Each optimizer is tuned using 600 hyperparameters. The results are in Figure 7, where notice that tridiag-SONew beats all the baselines by a large margin.

Adam vs SONew on 248M Language Model: We conduct an additional experiment on a 248M parameter transformer architecture [50] language model. The model is trained on Wiki Text103, introduced in [39]. We train the model for 3 epochs, having 8M tokens per epoch with a batch size of 8k tokens. We search over 48 hyperparameters, tuning learning rate 2e 2, 1e 2, 5e 3, 1e 3, β2 0.99, 0.999, weight decay 0.0, 0.1, and ϵ 1e 10, 1e 8, 1e 6, while fixing β1 = 0.9. Validation and training loss is given in Figure 8. Our observations indicate mixed results. While tds-SONew exhibits superior validation performance, Adam outperforms in training metrics. We believe that the 48 trials might have been insufficient for optimal tuning. It s conceivable that with further trials, SONew s training loss could surpass that of Adam. We leave this line of investigation for future work.

A.4.5 Convex experiments

As our regret bound applies to convex optimization, we compare SONew to rfd SON [37], another recent memory-efficient second-order Newton method. We follow [37] for the experiment setup - each dataset is split randomly in 70%/30% train and test set. Mean squared loss is used. For tridiag-SONew, we use a total of 2 d space for d parameters. Hence, for fair comparison we show rfd SON with m = 2. Since the code isn t open sourced, we implemented it ourselves. In order to show reproducibility with respect to the reported numbers in [37], we include results with m = 5 as well. We see in the Table 9 that tridiag-SONew consitently matches or outperforms rfd SON across all 3 benchmarks. Each experiment was run for 20 epochs and we report the best model s performance on test set.

Table 9: Comparison of rfd SON and tridiag-SONew in convex setting on three datasets. We optimize least square loss P t(yt w T xt)2 where w is the learnable parameter and (xt, yt) is the tth training point. Reported numbers is the accuracy on the test set.

Table 10: (a) Dataset stats

Dataset # total points dimension

a9a 32,561 123 gisette 6000 5000 mnist 11791 780

Table 11: (b) RFD-SON vs tridiag-SONew

Dataset RFD-SON, m=2 RFD-SON, m=5 tridiag-SONew

a9a 83.3 83.6 84.6 gisette 96.1 96.2 96.6 mnist 93.2 94.5 96.5

Table 12: Optimal hyperparams for Autoencoder Benchmark

Table 13: (a) float32 experiments optimal hyperparamters

Baseline β1 β2 ϵ lr

SGD 0.99 0.91 8.37e-9 1.17e-2 Nesterov 0.914 0.90 3.88e-10 5.74e-3 Adagrad 0.95 0.90 9.96e-7 1.82e-2 Momentum 0.9 0.99 1e-5 6.89e-3 RMSProp 0.9 0.9 1e-10 4.61e-4 Adam 0.9 0.94 1.65e-6 3.75e-3 Diag-SONew 0.88 0.95 4.63e-6 1.18e-3

Shampoo 0.9 0.95 9.6e-9 3.70e-3 tridiag 0.9 0.96 1.3e-6 8.60e-3 band-4 0.88 0.95 1.5e-3 5.53e-3

Table 14: (b) bfloat16 experiments optimal hyperparamters

Baseline β1 β2 ϵ lr

SGD 0.96 0.98 2.80e-2 1.35e-2 Nesterov 0.914 0.945 8.48e-9 6.19e-3 Adagrad 0.95 0.93 2.44e-5 2.53e-2 Momentum 0.9 0.99 0.1 7.77e-3 RMSProp 0.9 0.9 2.53e-10 4.83e-4 Adam 0.9 0.94 3.03e-10 3.45e-3 Diag-SONew 0.9 0.95 4.07e-6 8.50e-3

Shampoo 0.85 0.806 6.58e-4 5.03e-3 ztridiag 0.83 0.954 1.78e-6 7.83e-3 band-4 0.9 0.96 1.52e-6 4.53e-3