# dimensionfree_adaptive_subgradient_methods_with_frequent_directions__0d6333ca.pdf

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Sifan Yang * 1 2 Yuanyu Wan * 3 4 Peijia Li 1 2 Yibo Wang 1 2 Xiao Zhang 5 Zhewei Wei 5 Lijun Zhang 1 6 2

In this paper, we investigate the acceleration of adaptive subgradient methods through frequent directions (FD), a widely-used matrix sketching technique. The state-of-the-art regret bound exhibits a linear dependence on the dimensionality d, leading to unsatisfactory guarantees for high-dimensional problems. Additionally, it suffers from an O(τ 2d) time complexity per round, which scales quadratically with the sketching size τ. To overcome these issues, we first propose an algorithm named FTSL, achieving a tighter regret bound that is independent of the dimensionality. The key idea is to integrate FD with adaptive subgradient methods under the primal-dual framework and add the cumulative discarded information of FD back. To reduce its time complexity, we further utilize fast FD to expedite FTSL, yielding a better complexity of O(τd) while maintaining the same regret bound. Moreover, to mitigate the computational cost for optimization problems involving matrix variables (e.g., training neural networks), we adapt FD to Shampoo, a popular optimization algorithm that accounts for the structure of decision, and give a novel analysis under the primal-dual framework. Our proposed method obtains an improved dimension-free regret bound. Experimental results have verified the efficiency and effectiveness of our approaches.

*Equal contribution 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2School of Artificial Intelligence, Nanjing University, Nanjing, China 3School of Software Technology, Zhejiang University, Ningbo, China 4Hangzhou High Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou, China 5Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 6Pazhou Laboratory (Huangpu), Guangzhou, China. Correspondence to: Lijun Zhang <zhanglj@lamda.nju.edu.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1. Introduction

Adaptive subgradient methods have attracted considerable research interest in past decades, which simplify the learning rate selection while ensuring that their regret bounds are comparable to those obtained through manual tuning (Duchi et al., 2010a; Hazan & Koren, 2012; Agarwal et al., 2019). The pioneering work of Duchi et al. (2011) introduces adaptive subgradient methods with full matrices (ADAFULL) within both the primal-dual subgradient framework (Xiao, 2009) and the mirror descent framework (Duchi et al., 2010b). ADA-FULL achieves a regret bound of O(tr(G1/2 T )), where GT is the sum of gradient outer products over T rounds, and this regret bound is better than that of non-adaptive methods when gradients are sparse. However, ADA-FULL requires maintaining a preconditioning matrix to store the past gradient outer products and computing the inverse of this preconditioning matrix, resulting in an O(d2) space complexity and an O(d3) time complexity, respectively, where d is the dimensionality. Thus, ADAFULL is impractical for large-scale machine learning tasks involving high-dimensional data.

To address these limitations, several studies propose adopting frequent directions (FD) (Ghashami et al., 2016) to reduce the computational complexity of ADA-FULL (Wan et al., 2018; Wan & Zhang, 2022; Feinberg et al., 2023). In particular, Wan et al. (2018) first develop an efficient variant of ADA-FULL, namely ADA-FD, by employing FD to approximate the sum of gradient outer products over the past rounds. Let τ d denote the sketching size and ρt denote the discarded eigenvalue of FD in round t. ADA-FD reduces the space and time complexities to O(τd) and O(τ 2d), while enjoying O(tr(G1/2 T ) + PT t=1 ρt) and O(tr(G1/2 T ) + PT t=1 τ ρt) regret bounds under the primal-dual subgradient framework and the mirror descent framework, respectively. Moreover, by exploiting an accelerated trick for FD, Wan & Zhang (2022) further propose ADA-FFD with O(τd) space and time complexities, while keeping the same regret bounds. Recently, Feinberg et al. (2023) design Sketchy-ADAGRAD (S-ADA) under the mirror descent framework, which achieves a better O(tr(G1/2 T ) + p

d(d τ)ρ1:T ) regret bound, where ρ1:T = PT t=1 ρt. The key idea is to utilize a variant of FD (Chen et al., 2020), which adds back the cumulative

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Table 1. Comparison of ADA-FULL and its FD-based variants, where ADA-FD (P) and ADA-FD (M) represent ADA-FD under the primal-dual subgradient framework the mirror descent framework, respectively. We denote λi be the i-th eigenvalue of GT .

Algorithms Regret Bounds Space Time

ADA-FULL (Duchi et al., 2011) O(tr(G1/2 T )) O(d2) O(d3)

ADA-FD (P) (Wan et al., 2018) O(tr(G1/2 T ) + PT t=1

ρt) O(τd) O(τ 2d)

ADA-FD (M) (Wan et al., 2018) O(tr(G1/2 T ) + PT t=1 τ ρt) O(τd) O(τ 2d)

S-ADA (Feinberg et al., 2023) O(tr(G1/2 T ) + p

d(d τ)ρ1:T ) O(τd) O(τ 2d)

FTSL (this work) O(tr(G1/2 T ) + ρ1:T ) O(τd) O(τ 2d)

Table 2. Comparison of FFD-based variants of ADA-FULL, where ADA-FFD (P) and ADA-FFD (M) represent ADA-FFD under the primal-dual subgradient framework and the mirror descent framework, respectively.

Algorithms Regret Bounds Space Time

ADA-FFD (P) (Wan & Zhang, 2022) O(tr(G1/2 T ) + PT t=1

ρt) O(τd) O(τd)

ADA-FFD (M) (Wan & Zhang, 2022) O(tr(G1/2 T ) + PT t=1 τ ρt) O(τd) O(τd)

Fast S-ADA (this work) O(tr(G1/2 T ) + p

d(d τ)ρ1:T ) O(τd) O(τd)

FTFSL (this work) O(tr(G1/2 T ) + ρ1:T ) O(τd) O(τd)

dropped eigenvalues (referred to as the escaped mass) to keep the positive definite monotonicity of the preconditioning matrix. However, its regret bound depends on d, leading to unsatisfactory guarantees for high-dimensional problems, and its time complexity is O(τ 2d), which is worse than that of ADA-FFD. Thus, it is natural to ask whether the regret bound and the time complexity of Feinberg et al. (2023) can be further improved.

In this paper, we provide an affirmative answer to this question. Specifically, we first develop an algorithm, namely Follow-the-Sketchy-Leader (FTSL), to enhance the existing regret bound. We integrate FD with ADA-FULL under the primal-dual framework and add the cumulative discarded eigenvalues of FD back. FTSL enjoys a tighter dimensionfree O(tr(G1/2 T )+ ρ1:T ) regret bound, while obtaining the space and time complexities of O(τd) and O(τ 2d). Additionally, we propose an accelerated variant of FTSL, named FTFSL, by doubling the sketching size to reduce the number of time-consuming computations. FTFSL preserves the regret bound and space complexity of FTSL, while simultaneously lowering the time complexity to O(τd) when τ

d. Remarkably, we can also improve the time complexity of S-ADA by using this technique, but its regret bound still remains inferior to that of FTFSL. We summarize our results and comparisons with the previous work in Table 1 and Table 2.

Moreover, we investigate optimization problems with matrix variables Xt Rm n, a scenario commonly encountered in deep learning tasks. In this case, one can apply the aforementioned methods by flattening the gradient GX t Rm n into a vector gt Rmn, which, however, incurs a memory usage of O(τmn). To improve the memory efficiency, Feinberg et al. (2023) have adapted FD to Shampoo (Gupta et al., 2018), a popular adaptive preconditioning method that accounts for the structure of the parameters. Although Feinberg et al. (2023) reduce the space complexity to O(τ(m + n)), their regret bound again relies on the dimensionality m, n. To address this issue, we integrate FD with a primal-dual variant of Shampoo and obtain a dimension-free regret bound via a novel analysis. Our approach, termed FTSL-Shampoo, attains an enhanced theoretical guarantee that is independent of the dimensionality m, n. We contrast FTSL-Shampoo with previous methods in Table 3. Finally, we conduct experiments on online classification and neural network training to validate the superiority of our methods.

2. Related Work

In this section, we briefly review the related work on adaptive subgradient methods, their fast variants based on sketching, and Shampoo.

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Table 3. Comparison of adaptive subgradient methods for the case where the decision has a matrix structure Xt Rm n. We denote the gradient GX t Rm n, r is the largest rank of GX t , LT = ϵIm m + PT t=1 GX t (GX t ) , RT = ϵIn n + PT t=1(GX t ) GX t , ϵ is a hyper-parameter, ρL 1:T and ρR 1:T represent the sum of the removed eigenvalues in FD during the approximation of PT t=1 GX t (GX t ) and PT t=1(GX t ) GX t , respectively. For ADA-FULL and FTFSL, we define GT = PT t=1 gtg t Rmn mn, where gt = vec(GX t ) Rmn

and vec( ) denotes the row-major vectorization of a matrix.

Algorithms Regret Bounds Space Time

ADA-FULL (Duchi et al., 2011) O(tr(G1/2 T )) O(m2n2) O(m3n3)

FTFSL (this work) O(tr(G1/2 T ) + ρ1:T ) O(τmn) O(τmn)

Shampoo (Gupta et al., 2018) O( r tr(L1/4 T ) tr(R1/4 T )) O(m2 + n2) O(m3 + n3)

S-Shampoo (Feinberg et al., 2023) O( r(tr(L1/4 T ) + m(ρL 1:T )1/4)(tr(R1/4 T ) + n(ρR 1:T )1/4)) O(τ(m + n)) O(τ 2mn)

FTSL-Shampoo (this work) O( r(tr(L1/4 T ) + (ρL 1:T )1/4)(tr(R1/4 T ) + (ρR 1:T )1/4)) O(τ(m + n)) O(τ 2mn)

2.1. OCO and Adaptive Subgradient Methods

Online convex optimization (OCO) is a powerful paradigm for solving sequential decision-making problems (Hazan, 2016; Orabona, 2019; Zhang et al., 2018; 2022). Specifically, it is typically formulated as an iterative game between a player and an adversary. In each round t [T], the player begins by selecting a decision xt Rd. After that, the adversary chooses a convex loss function ft( ): Rd 7 R, and the player incurs a loss ft(xt). The goal of the player is to minimize the cumulative loss PT t=1 ft(xt) over T rounds, which is equivalent to minimizing the regret (Zinkevich, 2003)

t=1 ft(x ), (1)

defined as the excess loss suffered by the player compared to the loss of the fixed optimal choice x arg minx Rd PT t=1 ft(x). Although Zinkevich (2003) establishes the optimal regret bound of O(

T), it is dataindependent. In the following, we will introduce ADAGRAD (Duchi et al., 2010a; 2011), a widely-used adaptive subgradient method, in both the primal-dual subgradient framework (Xiao, 2009) and the mirror descent framework (Duchi et al., 2010b), which achieves a data-dependent regret bound.

ADAGRAD can be categorized into two forms based on how the preconditioner Gt is computed: ADAGRAD with full matrices (ADA-FULL) and ADAGRAD with diagonal matrices (ADA-DIAG). We denote gt be a particular vector in the subdifferential set ft(xt). Since we do not require the loss function to be smooth, we will not explicitly distinguish subgradients and gradients in the subsequent discussion. ADA-FULL first calculates the outer product matrix of the past gradients Gt = Pt i=1 gig i , and further

defines a symmetric matrix Gt = ϵId d + G1/2 t , where ϵ > 0 is a hyper-parameter introduced to ensure the invertibility of Gt. According to the primal-dual framework, the update rule is given by

xt+1 = arg min x Rd

t gt, x + 1

= η G 1 t gt,

where η is the learning rate, gt = Pt i=1 gi is the sum of the received gradients and Ψt(x) = 1

2 x, Gtx is the proximal term. The mirror descent version updates the decision as follows

xt+1 = arg min x Rd {η gt, x + BΨt(x, xt)}

= xt η G 1 t gt,

where BΨt(x, y) = Ψt(x) Ψt(y) Ψt(y), x y is the Bregman divergence associated with Ψt( ). ADAFULL achieves an O(tr(G1/2 T )) regret bound within the both frameworks. However, ADA-FULL store the past gradient outer products, i.e., Gt, and compute G 1/2 t , resulting in O(d2) and O(d3) space and time complexities, respectively. The high computational cost of ADA-FULL renders it unsuitable for out-of-the-box use in scenarios with limited resources (Sagun et al., 2017; Ghorbani et al., 2019; Sankar et al., 2021; Zhou, 2024).

Different from ADA-FULL, ADA-DIAG only utilizes the diagonal elements of the gradient outer product matrix, i.e., redefining Gt = ϵId d + diag(Gt)1/2, which thus is computationally more efficient. However, since the preconditioning matrix of ADA-DIAG only contains limited information, the regret bound of ADA-DIAG is worse than that of ADA-FULL when high-dimensional data is dense and has a low-rank structure.

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

2.2. Adaptive Subgradient Methods with Sketching

To alleviate the computational burden of ADA-FULL, several works have employed the sketching techniques to reduce its space and time complexities (Krummenacher et al., 2016; Wan & Zhang, 2018; Wan et al., 2018; Wan & Zhang, 2020; 2022; Feinberg et al., 2023). Krummenacher et al. (2016) propose ADA-LR to enhance the computational complexity of ADA-FULL by using random projection (Indyk & Motwani, 1998; Achlioptas, 2003). While ADA-LR reduces the time complexity to O(τd2), its space complexity remains at O(d2), where τ d is the sketching size. To further improve the efficiency, they develop RADAGRA, which incorporates a more randomized approximation, achieving space and time complexities of O(τd) and O(τ 2d), respectively. However, RADAGRA is not supported by rigorous theoretical analysis. Later, Wan & Zhang (2018) develop ADA-DP based on random projection, which achieves space and time complexities of O(τd) and O(τ 2d), while providing theoretical guarantees.

Another class of sketching-based adaptive subgradient methods adopts frequent directions (FD) (Ghashami et al., 2016), a stable matrix sketching technique. Wan et al. (2018) first apply FD with ADA-FULL under both the primal-dual subgradient framework and the mirror descent framework by maintaining a matrix Bt Rd τ, such that Bt B t Gt Rd d, where Gt represents the gradient covariance matrix. Their approach, ADA-FD, obtains space and time complexities of O(τd) and O(τ 2d), respectively. ADA-FD achieves regret bounds of O(tr(G1/2 T ) + PT t=1 ρt) and O(tr(G1/2 T ) + PT t=1 τ ρt) under the primal-dual subgradient framework and the mirror descent framework, where ρt is the removed eigenvalue of FD in round t. Furthermore, Wan & Zhang (2022) introduce a fast variant of ADA-FD, named ADA-FFD, by doubling the sketching size. ADAFFD improves the time complexity to O(τd) while keeping the same regret bounds. Although ADA-FD and ADA-FFD enjoy better space and time complexities, as mentioned by Feinberg et al. (2023), their regret bounds are Ω(T 3/4) in some cases. For this reason, Feinberg et al. (2023) propose S-ADA by adding back the discarded information of FD to the FD-based preconditioner instead of utilizing a fixed diagonal regularization. While S-ADA enjoys an O(tr(G1/2 T )+ p

d(d τ)ρ1:T ) regret bound, it suffers from a linear dependence on the dimensionality d. Moreover, it only achieves an unsatisfactory time complexity of O(τ 2d).

Additionally, we also notice that FD has been utilized to accelerate online Newton step (ONS) algorithm (Hazan et al., 2007) for exponentially concave functions, and Lin UCB (Chu et al., 2011) algorithm in linear contextual bandit setting, which also need to maintain a covariance matrix. Luo et al. (2016) first apply FD in ONS to construct a low-rank approximation of the matrix. To reduce the approximation

error of FD, Luo et al. (2019) propose a new sketching strategy called robust frequent directions (RFD), which is the first method that compensates the discarded singular values back into the second-order matrix. They utilize RFD to propose a hyperparameter-free variant of ONS, which is more robust than FD-SON. In linear contextual bandit setting, Chen et al. (2020) propose spectral compensation frequent directions (SCFD) to approximate the covariance matrices, which adds up the total mass of subtracted values during FD procedure. SCFD can approximate a sequence of incremental covariance matrices while keeping the positive definite monotonicity. In fact, S-ADA can be viewed as a combination of ADA-FULL with SCFD.

2.3. Shampoo

Shampoo (Gupta et al., 2018) is an adaptive optimization method that takes the structure of the parameter space into consideration and thus is more efficient than ADA-FULL in scenarios where the decision is a matrix. Specifically, Shampoo maintains a set of preconditioning matrices, each of which operates on one dimension, while aggregating information across the remaining dimensions. For example, for a parameter matrix Xt Rm n and its gradient GX t Rm n, ADA-FULL treats the matrix-shaped gradient as a vector of size mn and its preconditioner Gt has the size of mn mn, which leads to O(m2n2) and O(m3n3) space and time complexities, respectively. In contrast, Shampoo constructs two smaller matrices, Lt Rm m and Rt Rn n, to precondition the rows and columns of GX t , respectively, which only requires an O(m2 + n2) memory cost and an O(m3 + n3) computation complexity. Since the parameters in the deep learning tasks often have matrix structures, Shampoo has strong empirical performance and receives lots of attentions (Anil et al., 2020; Liu et al., 2023; Eschenhagen et al., 2024).

However, the memory demands of Shampoo may still be prohibitive for large-scale neural networks. Anil et al. (2020) address the memory cost of Shampoo by introducing two variants. The first variant, Blocked Shampoo, partitions the decision variable Xt Rm n into mn/b2 blocks, where b is the block size and b min(m, n). However, Blocked Shampoo depends on the specific ordering of neurons in the hidden layers. The second variant relies on one-sided covariance upper bounds, which cannot effectively handle vector parameters. Feinberg et al. (2023) first incorporate FD into Shampoo and reduce its memory to O(τ(m + n)). Their method, named Sketchy-Shampoo (S-Shampoo), uses two low-rank matrices ˆLt Rm τ and ˆRt Rn τ to approximate the preconditioning matrices Lt and Rt, respectively. While S-Shampoo improves the space complexity of Shampoo, its regret bound relies on the dimensions m, n, leading to the unsatisfactory performance when the dimensions are high.

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

3. Preliminaries

3.1. Assumptions

We adopt two common assumptions of OCO (Hazan, 2016).

Assumption 3.1. All loss functions ft( ) are convex.

Assumption 3.2. The optimal decision x Rd is bounded by D, i.e., x D.

Besides, we introduce two assumptions for the scenario where the decision has a matrix structure. These assumptions have also been used in prior works (Gupta et al., 2018; Feinberg et al., 2023).

Assumption 3.3. The rank of gradient matrix GX t is bounded by r, i.e., maxt [T ] rank(GX t ) r.

Assumption 3.4. The optimal parameter X Rm n is bounded by DM, i.e., X F DM.

3.2. Frequent Directions

Frequent directions (FD) (Ghashami et al., 2016) is a deterministic matrix sketching technique by extending the well-known algorithm for approximating item frequencies in online data streams (Misra & Gries, 1982). For a given matrix A Rd t, FD aims to generate a matrix B Rd τ

such that BB AA , where τ min{t, d} is the sketching size. The procedure is summarized in Algorithm 1. In each round t, we denote the low-rank matrix Bt 1 = [b1, b2, ..., bτ 1, 0d] Rd τ, where the last column is 0d. Upon receiving the new gradient gt Rd, it is inserted into the last column of Bt 1. Next, we perform singular value de-

composition (SVD) on Bt 1 = Ut q

diag(λ(t) [1:τ])V t , and

the matrix Bt is computed as Bt = Ut q

diag(λ(t) [1:τ] λ(t) τ ) with its last column set to 0d. The time complexity of FD is O(τ 2d) for each iteration, which is dominated by computing the SVD of Bt 1, causing a quadratic dependence on sketching size τ.

To further reduce the time complexity of FD, Ghashami et al. (2016) propose fast frequent directions (FFD) by expanding the space of Bt. Specifically, FFD maintains a matrix B0 = 0d 2τ Rd 2τ. In each round t, we insert the received gradient gt into the first all-zero column of Bt 1. Once Bt no longer contains any all-zero columns, we perform SVD

to obtain Bt = Ut q

diag(λ(t) [1:2τ])V t . Then, the matrix

Bt is updated as Bt = Ut q

diag(max{λ(t) [1:2τ] λ(t) τ , 0}), ensuring that the last τ + 1 columns are set to 0d. As we only need to update the matrix Bt every τ + 1 rounds, the time complexity of FFD is O(τd).

Since FD removes a singular value per round, the matrix Bt B t does not preserve monotonicity. To resolve this limitation, Chen et al. (2020) propose spectral compensation

Algorithm 1 Frequent Directions (FD)

1: Input: Sketching matrix Bt 1 Rd τ (with its last column as 0d), new gradient vector gt Rd

2: Insert the gradient gt into the last column of Bt 1

3: Perform SVD to Bt 1 = Ut q

diag(λ(t) [1:τ])V t , where

4: Compute Bt = Ut q

diag(λ(t) [1:τ] λ(t) τ )

5: Return: Bt and λ(t) τ

frequent directions (SCFD), which adds up the total mass of subtracted values Pt i=1 λ(t) τ during FD procedure. SCFD is able to approximate a sequence of high-dimensional matrices while preserving positive definite monotonicity, i.e., Pt i=1 λ(i) τ Id d + Bt B t Pt 1 i=1 λ(i) τ Id d + Bt 1B t 1.

4. The Proposed Methods

In this section, we first present FTSL, which incorporates FD with ADA-FULL under the primal-dual framework to obtain a better regret bound. Furthermore, we accelerate FTSL by employing an accelerated trick for FD, achieving enhanced computational efficiency. We demonstrate that this technique can be applied to expediting S-ADA (Feinberg et al., 2023). Additionally, we consider optimization problems involving matrix variables and propose an improved FD-based variant of Shampoo.

4.1. Our Improved Result

Before introducing our algorithms, we first briefly discuss why the regret bound of S-ADA (Feinberg et al., 2023) relies on the dimensionality d, offering motivation for the methods we design. Since S-ADA is under the mirror descent framework, its regret bound contains the Bregman divergence term, that is,

BΨt+1(x , xt+1) BΨt(x , xt+1) !

t=0 xt+1 x 2 e G1/2 t+1 e G1/2 t

where Ψt(x) = 1 2 x, G1/2 t x is the proximal term, Gt is the preconditioning matrix and BΨt(x, y) = Ψt(x) Ψt(y) Ψt(y), x y . To facilitate summation, they upper bound this term by O(PT 1 t=0 tr( G1/2 t+1 G1/2 t )) and then exploit the additivity of the trace, which yields a bound of O(tr( G1/2 T )). Feinberg et al. (2023) add the cumulative removed eigenvalues of FD ρ1:t into Gt. Consequently, the Bregman divergence term is O(tr((BT B T + ρ1:T Id d)1/2)), which is further bounded by O(tr(G1/2 T ) +

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Algorithm 2 Follow the Sketchy Leader (FTSL)

1: Input: Learning rate η, sketching size τ 2: Initialize x1 = 0d, g0 = 0d, G0 = 0d d, B0 = 0d τ 3: for t = 1 to T do 4: Play the decision xt and suffer the loss ft(xt) 5: Query the gradient gt = ft(xt) and calculate gt = gt 1 + gt 6: Send Bt 1 and gt to Algorithm 1 7: Receive Bt and set ρt = λt τ 8: Calculate Gt = Bt B t + ρ1:t Id d and derive G 1/2 t 9: Update xt+1 according to (2) 10: end for

d(d τ)ρ1:T ), where BT Rd τ is the sketching matrix and GT = PT t=1 gtg t . As a result, the regret bound of S-ADA exhibits a linear dependence on d, resulting in an unsatisfactory performance in high-dimensional problems.

To overcome this issue, we propose integrating FD with ADA-FULL under the primal-dual subgradient framework. Our method, which we call FTSL, is outlined in Algorithm 2. Specifically, we employ the FD to construct a low-rank approximation of the outer product matrix of gradients Gt, aiming to reduce the computational complexity. To ensure the monotonicity of the preconditioning matrix Gt, we also add back the cumulative escaped masses ρ1:t into Gt. Under the primal-dual subgradient framework, we update the decision as follows

xt+1 = arg min x Rd {η gt, x + Ψt(x)}

= η G 1/2 t gt, (2)

where gt = Pt i=1 gi is the sum of the past gradients. According to the analysis under the primal-dual framework, the regret of FTSL is upper bounded by the term O( G1/2 T ) = O( (BT B T + ρ1:T Id d)1/2 ) O( G1/2 T + ρ1:T ), thereby avoiding the dependence on d.

Formally, we present the theoretical guarantee of FTSL.

Theorem 4.1. Under Assumption 3.1 and Assumption 3.2, by setting the learning rate η = D

2, FTSL ensures

R(T) O tr(G1/2 T ) + ρ1:T ,

where GT = PT t=1 gtg t .

Remark. In contrast to the previous regret bound of O(tr(G1/2 T ) + p

d(d τ)ρ1:T ) (Feinberg et al., 2023), the regret bound of FTSL is dimension-free, a benefit realized from the primal-dual subgradient framework.

Remark. Since we only maintain a sketching matrix Bt Rd τ, the space complexity of FTSL is O(τd). Its time

Algorithm 3 Follow the Fast Sketchy Leader (FTFSL)

1: Input: Learning rate η, sketching size τ 2: Initialize x1 = 0d, G0 = 0d d, r0 = 0, M0 = 02τ 2τ, V0 = 0d 2τ, g0 = 0d, ρ1 = 0 3: for t = 1 to T do 4: Play the decision xt and suffer the loss ft(xt) 5: Query the gradient gt = ft(xt) and compute g t = Vt 1(V t 1gt), gt = gt 1 + gt 6: if g t = gt then

7: Set rt 1 = rt 1 + 1, calculate vt 1 rt 1 = gt g t gt g t and set the rt 1-th column of Vt 1 as vt 1 rt 1 8: end if 9: Set rt = rt 1, Vt = Vt 1 10: Compute Mt = Mt 1 + (V t 1gt)(V t 1gt)

11: Perform SVD decomposition on Mt, which is UtΣt U t = Utdiag(λ(t) [1:2τ])U t = Mt

12: Calculate Gt = ρ1:t Id d + Vt UtΣt Ut V t 13: Update xt+1 according to (2) and set ρt+1 = 0 14: if rt = 2τ then 15: Set ρt+1 = λ(t) τ , Mt = diag(max{λ(t) [1:2τ]

λ(t) τ , 0}) and Vt = Vt Ut 16: Set rt = τ 1 and the τ-th to 2τ-th columns of Vt be 0d 17: end if 18: end for

complexity is O(τ 2d) per round, which arises from the SVD of Bt and the calculation of G 1/2 t (the detailed discussions can be found in Appendix A). While the time complexity of FTSL is linear with respect to the dimensionality d, it still suffers from the quadratic dependence on the sketching size τ. To further alleviate its computational burden, we develop a fast variant of FTSL in the next section.

4.2. Our Accelerated Variant

The time complexity of FTSL suffers from a quadratic dependence on the sketching size τ, which is introduced by the SVD decomposition on Bt Rd τ in FD. Drawing inspiration from the previous work (Chen et al., 2020; Wan & Zhang, 2022), we adopt a more efficient strategy for computing the SVD of sketching matrix Bt. Our method, termed FTFSL, is presented in Algorithm 3.

Different from FD, the sketching matrix Bt is expanded to Rd 2τ in FFD. Rather than explicitly maintaining Bt, we use two matrices Vt and Mt to form Bt. Specifically, Vt = [vt 1, , vt 2τ] Rd 2τ consists of rt orthonormal vectors (rt 2τ) and the rest columns are zero vectors, and Mt R2τ 2τ is a symmetric matrix. We require that Vt and Mt satisfy the condition Vt M 1/2 t = Bt Rd 2τ. In each round t, after receiving the gradient gt, we first check whether this

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

vector lies within the subspace spanned by Vt 1. If the vector is not contained within the subspace, we normalize it and subsequently add it to Vt 1, thereby enlarging the span of the subspace and ensuring Vt 1V t 1gt = gt (Step 6-8). Then we have the following equation

Vt 1Mt 1V t 1 + gtg t =Vt 1 Mt 1 + V t 1gtg t Vt 1 V t 1.

This implies that we only need to perform an SVD decomposition on Mt 1 + (V t 1gt)(V t 1gt) R2τ 2τ, which only takes a time complexity of O(τ 3). Next, we incorporate the escaped masses to keep the monotonicity of the preconditioning matrix Gt. When rt = 2τ, we need to set rt = τ 1 and set the last τ + 1 columns of the sketching matrix Bt to be zero (Step 14-16). Given the decomposition Mt = Ut diag(λ(t) [1:2τ])U t and the relationship

Vt M 1/2 t = Bt, this can be efficiently achieved by updating Mt as diag(max{λ(t) [1:2τ] λ(t) τ , 0}), calculating Vt = Vt Ut, and setting the last τ + 1 columns of Vt to zero.

In the following, we provide the theoretical guarantee of FTFSL.

Theorem 4.2. Under Assumption 3.1 and Assumption 3.2, by setting the learning rate η = D

2, FTFSL ensures

R(T) O tr(G1/2 T ) + ρ1:T .

Remark. In each round, FTFSL computes g t, vt 1 rt 1, Mt, SVD of Mt and update xt, with respective time complexities of O(τd), O(d), O(τd), O(τ 3) and O(τd). Additionally, we only compute Vt = Vt Ut every τ + 1 rounds, incurring a time complexity of O(τ 2d). When τ O(

d), the time complexity of FTSL is O(τd) per round.

Notably, we can also reduce the time complexity of S-ADA (Feinberg et al., 2023) by adopting this technique. We replace the update rule for the decision variable xt of FTFSL (Step 13) with the following

xt+1 = xt η G 1/2 t gt,

and analyze under the mirror descent framework.

Then we provide the theoretical guarantee of Fast S-ADA.

Theorem 4.3. Under Assumption 3.1 and assuming D1 = maxt [T ] xt x , by setting the learning rate η = D1

2, Fast S-ADA ensures

R(T) O tr(G1/2 T ) + p

d(d τ)ρ1:T .

Remark. Compared to S-ADA (Feinberg et al., 2023), Fast S-ADA obtains a better O(τd) time complexity when τ O(

d), while preserving the same regret bound.

Algorithm 4 Frequent Directions in General Form

1: Input: Sketching matrix Bt 1 Rd τ, a new symmetric PSD matrix Mt Rd d

2: Eigendecompose U tdiag(λ(t))U t = Bt 1B t 1 + Mt, define Ut Rd τ be the first τ columns of U t and λ(t) [1:τ] be its eigenvalues

3: Compute Bt = Ut q

diag(λ(t) [1:τ] λ(t) τ )

4: Return: Bt and λ(t) τ

Algorithm 5 FTSL-Shampoo Require: Learning rate η, sketching size τ, ϵ > 0

1: Initialize X1 = 0m n, ˆL0 = 0m τ, ˆR0 = 0n τ, L0 = 0m m, R0 = 0n n, G X 0 = 0m n 2: for t = 1 to T do 3: Play the decision Xt and suffer the loss ft(Xt) 4: Query the gradient GX t = ft(Xt) Rm n and calculate G X t = G X t 1 + GX t 5: Send ˆLt 1 and GX t (GX t ) to Algorithm 4 and receive ˆLt, ρL t 6: Send ˆRt 1 and (GX t ) GX t to Algorithm 4 and receive ˆRt, ρR t 7: Update Lt = ˆLt ˆL t + (ϵ + ρL 1:t)Im m 8: Update Rt = ˆRt ˆR t + (ϵ + ρR 1:t)In n 9: Update Xt+1 according to (3) 10: end for

4.3. Optimization Problems with Matrix Variables

In this section, we consider a practical scenario where the decision variable is a matrix Xt Rm n, which is common for parameters in deep learning tasks. In such settings, the loss f(X) is typically a smooth non-convex function, and the objective is to find a point XT such that f(XT ) ϵ. As pointed out by Agarwal et al. (2019), a smooth nonconvex problem can be transformed into solving a series of offline convex problems by using the online to batch conversion. Therefore, we can derive the non-convex optimization guarantees from online regret bounds, with further details provided in Appendix B.

To utilize the structure information, Gupta et al. (2018) propose Shampoo, which retains the matrix structure of the gradient and maintains two matrices as preconditioners of the rows and columns of GX t , yielding a space complexity of O(m2 + n2). While S-Shampoo (Feinberg et al., 2023) improves the space complexity of Shampoo to O(τ(m+n)), its regret bound again relies on the dimensionality m, n. To further reduce its regret bound, we propose FTSL-Shampoo by integrating FD with Shampoo under the primal-dual framework. Our method achieves a superior dimension-free guarantee with obtaining an O(τ(m+n)) space complexity,

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

0 500 1000 1500 2000 # of iterations

0 500 1000 1500 2000 # of iterations

0 500 1000 1500 2000 # of iterations

ADA-DIAG RADAGRAD FD-SON ADA-FFD (M) ADA-FFD (P) S-ADA Fast S-ADA FTFSL

Figure 1. Results for Gisette dataset.

0 2000 4000 # of iterations

0 2000 4000 # of iterations

0 2000 4000 # of iterations

ADA-DIAG RADAGRAD FD-SON ADA-FFD (M) ADA-FFD (P) S-ADA Fast S-ADA FTFSL

Figure 2. Results for Epsilon dataset.

which is presented in Algorithm 5.

Specifically, we utilize FD to approximate the left and right preconditioning matrices for Shampoo. We maintain two matrices ˆLt Rm τ, ˆRt Rn τ to ensure that ˆLt ˆL t Pt i=1 GX i (GX i ) , ˆRt ˆR t Pt i=1(GX i ) GX i . We track the cumulative escaped masses ρL 1:t and ρR 1:t of the left and right preconditioning matrices separately, and then add them back into Lt and Rt to uphold the monotonicity. To achieve a dimension-free regret bound with FD, we update the parameters as follows:

Xt+1 = η L 1/4 t G X t R 1/4 t , (3)

where G X t = Pt i=1 GX i is the sum of the past gradients, and conduct the analysis under the primal-dual framework. Then we present the regret bound of FTSL-Shampoo.

Theorem 4.4. Under Assumption 3.1, Assumption 3.3 and Assumption 3.4, by setting the learning rate η = DM r and

further denoting LT = ϵIm m + PT t=1 GX t (GX t ) , RT =

ϵIn n + PT t=1(GX t ) GX t , FTSL-Shampoo ensures

R(T) O r(tr(L1/4 T ) + (ρL 1:T )1/4)

(tr(R1/4 T ) + (ρR 1:T )1/4) ,

where ρL 1:T and ρR 1:T represent the sum of the removed eigenvalues of FD during the approximation of PT t=1 GX t (GX t ) and PT t=1(GX t ) GX t , respectively.

Remark. In comparison to the previous O( r(tr(L1/4 T ) + m(ρL 1:T )1/4)(tr(R1/4 T ) + n(ρR 1:T )1/4)) regret bound of SShampoo (Feinberg et al., 2023), we achieve a dimensionfree regret bound while enjoying the same O(τ(m + n)) space complexity.

5. Experiments

In this section, we assess the performance of the proposed methods via numerical experiments on online classification and image classification tasks. Due to the limited space, we only present a subset of the experimental outcomes, with the comprehensive set of results accessible in Appendix C.

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Online Classification. First, we perform online classification to evaluate the performance of our methods with two real-world datasets from LIBSVM (Chang & Lin, 2011) repository: Gisette and Epsilon, which are highdimensional and dense. Particularly, Gisette dataset contains 6000 training samples and 1000 testing samples, each with 5000 features. Epsilon dataset consists of 400, 000 training samples and 100, 000 testing samples, each with 2000 features. In each round t [T], a batch of training examples {(wt,1, yt,1) , . . . , (wt,n, yt,n)} arrive, where (wt,i, yt,i) [ 1, 1]d { 1, 1}, i = 1, . . . , n. The online learner aims to predict a linear model xt and suffers the hinge loss ft(xt) = 1

n Pn i=1 max{0, 1 ytx t wt,i}. For Gisette dataset, we set the batch size n = 32, the sketching size τ = 50 to be 1% of the original dimensionality, and T = 2000. For Epsilon dataset, we set the batch size n = 128, τ = 20 and T = 5000.

Results. Following Duchi et al. (2011), we adopt the performance of accuracy on the testing data to compare different methods. To better demonstrate the improvements of our methods, we additionally plot the training loss and runtime of various methods. From Figure 1 and Figure 2, we observe that FTFSL outperforms all other methods in both loss and testing accuracy, aligning with its superior regret bound. Moreover, FTFSL and Fast S-ADA exhibit significantly lower runtimes compared to S-ADA, owing to their superior time complexities.

6. Conclusion

In this paper, we investigate adaptive subgradient methods with Frequent Directions (FD). First, we introduce a novel method, named FTSL, to achieve a tighter dimension-free regret bound of O(tr(G1/2 T ) + ρ1:T ). Next, we propose a fast version of FTSL by accelerating FD used in it, which improves the time complexity to O(τd) while preserving the same regret bound. This technique can also be applied to expedite S-ADA (Feinberg et al., 2023). Additionally, we consider a more complex scenario where the decision is a matrix, and adapt FD to Shampoo under the primaldual framework to obtain a better dimension-free bound. Finally, we substantiate the effectiveness and efficiency of our methods through experimental validation.

Acknowledge

This work was partially supported by National Science and Technology Major Project (2022ZD0114801), NSFC (U23A20382), and Yongjiang Talent Introduction Programme (2023A-193-G). The authors would like to thank the anonymous reviewers for their constructive suggestions.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Achlioptas, D. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671 687, 2003.

Agarwal, N., Bullins, B., Chen, X., Hazan, E., Singh, K., Zhang, C., and Zhang, Y. Efficient full-matrix adaptive regularization. In Proceedings of the 36th International Conference on Machine Learning, pp. 102 110, 2019.

Anil, R., Gupta, V., Koren, T., Regan, K., and Singer, Y. Scalable second order optimization for deep learning. ar Xiv preprint ar Xiv:2002.09018, 2020.

Chang, C.-C. and Lin, C.-J. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology, 2(3):1 27, 2011.

Chen, C., Luo, L., Zhang, W., Yu, Y., and Lian, Y. Efficient and robust high-dimensional linear contextual bandits. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, pp. 4259 4265, 2020.

Chu, W., Li, L., Reyzin, L., and Schapire, R. E. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 208 214, 2011.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In Proceedings of the 23rd Annual Conference on Learning Theory, 2010a.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Duchi, J. C., Shalev-Shwartz, S., Singer, Y., and Tewari, A. Composite objective mirror descent. In Proceedings of the 23rd Annual Conference on Learning Theory, pp. 14 26, 2010b.

Eschenhagen, R., Immer, A., Turner, R., Schneider, F., and Hennig, P. Kronecker-factored approximate curvature for modern neural network architectures. In Advances in Neural Information Processing Systems 37, 2024.

Feinberg, V., Chen, X., Sun, Y. J., Anil, R., and Hazan, E. Sketchy: Memory-efficient adaptive regularization with frequent directions. In Advances in Neural Information Processing Systems 37, 2023.

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Ghashami, M., Liberty, E., Phillips, J. M., and Woodruff, D. P. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762 1792, 2016.

Ghorbani, B., Krishnan, S., and Xiao, Y. An investigation into neural net optimization via hessian eigenvalue density. In Proceedings of the 36th International Conference on Machine Learning, pp. 2232 2241, 2019.

Gupta, V., Koren, T., and Singer, Y. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the 35th International Conference on Machine Learning, pp. 1842 1850, 2018.

Hager, W. W. Updating the inverse of a matrix. SIAM review, 31(2):221 239, 1989.

Hazan, E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Hazan, E. and Koren, T. Online gradient descent with adaptive step size for convex optimization. In Proceedings of the 25th Annual Conference on Learning Theory, pp. 77 90, 2012.

Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016.

Indyk, P. and Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604 613, 1998.

Krizhevsky, A. Learning multiple layers of features from tiny images. Masters Thesis, Deptartment of Computer Science, University of Toronto, 2009.

Krummenacher, G., Mc Williams, B., Kilcher, Y., Buhmann, J. M., and Meinshausen, N. Scalable adaptive stochastic optimization using random projections. In Advances in Neural Information Processing Systems 29, 2016.

Lancaster, P. and Farahat, H. K. Norms on direct sums and tensor products. Mathematics of Computation, 26(118): 403 410, 1972.

Liu, H., Li, Z., Hall, D., Liang, P., and Ma, T. Sophia: A scalable stochastic second-order optimizer for language model pre-training. ar Xiv preprint ar Xiv:2305.14342, 2023.

Luo, H., Agarwal, A., Cesa-Bianchi, N., and Langford, J. Efficient second order online learning by sketching. In Advances in Neural Information Processing Systems 29, 2016.

Luo, L., Zhang, W., Zhang, Z., Zhu, W., Zhang, T., and Pei, J. Sketched follow-the-regularized-leader for online factorization machine. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1900 1909, 2018.

Luo, L., Chen, C., Zhang, Z., Li, W.-J., and Zhang, T. Robust frequent directions with application in online learning. Journal of Machine Learning Research, 20(45):1 41, 2019.

Merity, S. The wikitext long term dependency language modeling dataset. Salesforce Metamind, 9, 2016.

Misra, J. and Gries, D. Finding repeated elements. Science of computer programming, 2(2):143 152, 1982.

Orabona, F. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Sagun, L., Evci, U., Guney, V. U., Dauphin, Y., and Bottou, L. Empirical analysis of the hessian of over-parametrized neural networks. ar Xiv preprint ar Xiv:1706.04454, 2017.

Sankar, A. R., Khasbage, Y., Vigneswaran, R., and Balasubramanian, V. N. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp. 9481 9488, 2021.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 5998 6008, 2017.

Wan, Y. and Zhang, L. Accelerating adaptive online learning by matrix approximation. In Advances in Knowledge Discovery and Data Mining, pp. 405 417, 2018.

Wan, Y. and Zhang, L. Accelerating adaptive online learning by matrix approximation. International Journal of Data Science and Analytics, 9(4):389 400, 2020.

Wan, Y. and Zhang, L. Efficient adaptive online learning via frequent directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6910 6923, 2022.

Wan, Y., Wei, N., and Zhang, L. Efficient adaptive online learning via frequent directions. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 2748 2754, 2018.

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Xiao, L. Dual averaging method for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 22, 2009.

Zhang, L., Lu, S., and Zhou, Z.-H. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31 (Neur IPS), pp. 1323 1333, 2018.

Zhang, L., Wang, G., Yi, J., and Yang, T. A simple yet universal strategy for online convex optimization. In Proceedings of the 39th International Conference on Machine Learning (ICML), pp. 26605 26623, 2022.

Zhou, Z.-H. Learnability with time-sharing computational resource concerns. National Science Review, 11(10), 2024.

Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pp. 928 936, 2003.

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

A. Proof of Theorems

Notations. We denote x to represent a vector and X to represent a matrix. For any vector x Rd and a positive semi-definite matrix A Rd d, x 2 A = x, Ax . For a matrix A, A 1 is the inverse of A if A is full rank; otherwise, A 1 is taken to be the Moore-Penrose pseudoinverse. For two matrices A, B, A B if and only if B A is a positive semi-definite matrix. we define the space complexity and time complexity as the memory and time usage in each round, respectively. For example, the time complexity of performing SVD to a matrix A Rm n is O(min{m2n, n2m}). For simplicity, we use for 2 by default. λ[1:τ] is a sequence containing τ elements, and max{λ[1:τ], a} represents a new sequence, where each element is the maximum of λi and a, and diag(λ[1:τ]) is a diagonal matrix where the i-th diagonal element is λi, 1 i 2τ. In the proof, we do not require smoothness of loss functions and we do not explicitly distinguish subgradients and gradients.

A.1. Calculation of G 1/2 t

In FTSL, we need to calculate G 1/2 t , which can not be directly derived by Woodbury formula (Hager, 1989). We provide the following process.

Assume U t is the complementary subspace of Ut Rd τ, we have Id d = Ut U t + U t (U t ) . We denote diag(λ(t) [1:τ]

λ(t) τ ) = Σt Rτ τ and have

ρ1:t Id d + UtΣt U t = Ut(Σt + ρ1:t Id d)U t + ρ1:t Id d U t (U t )

Ut(Σt + ρ1:t)U t = ρ1:t Id d + UtΣt U t ρ1:t U t (U t ) .

Then we can get

ρ1:t Id d + UtΣt U t = Ut(Σt + ρ1:t)U t + ρ1:t U t (U t ) = [Ut; U t ]Σ t[Ut; U t ] ,

where Σ t = Σt + ρ1:t Iτ τ 0 0 ρ1:t Id τ,d τ

Therefore, we have q

ρ1:t Id d + UtΣt U t = Ut p

(Σt + ρ1:t Iτ τ)U t + ρ1:t U t (U t )

= ρ1:t Id d + Ut( p

Σt + ρ1:t Iτ τ p

ρ1:t Iτ τ)U t .

We can apply Woodbury formula (Hager, 1989) on it.

We can derive

(ρ1:t Id d + UtΣt U t ) 1/2 = 1 ρ1:t

Σt + ρ1:t Iτ τ 1 p

Σt + ρ1:t Iτ τ p

ρ1:t Iτ τ U t

A.2. Proof of Theorem 4.1

Before giving the proof of Theorem 4.1, we first introduce some supporting lemmas.

Lemma A.1. (Proposition 2 in Duchi et al. (2011)) Let {xt} be the decisions of Algorithm 2 and x arg minx Rd PT t=1 ft(x), we have

t=1 ft(x ) 1

η ΨT (x ) + η

t=1 gt 2 G 1/2 t 1 ,

where ΨT (x ) = 1

2 D x , G1/2 T x E .

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Lemma A.2. (Lemma 10 in Duchi et al. (2011)) Let Gt = Pt i=1 gig i . We have

gt, G1/2 t 1 gt

gt, G1/2 T 1 gt

= 2 tr G1/2 T .

Then, we illustrate how FD approximates the original matrix. Lemma A.3. (Remark 11 in Feinberg et al. (2023)) With Gt = Pt i=1 gig i , for FD results in FTSL, we have

Bt B t Gt Gt = ρ1:t Id d + Bt B t .

By using Lemma A.1, we can bound the regret of Algorithm 2 by

t=1 ft(x ) 1

η ΨT (x ) | {z } RD

t=1 gt 2 G 1/2 t 1 | {z } RG

Then we bound the term RD and RG, respectively.

As for RD, we have

ΨT (x ) = 1

D x , G1/2 T x E = 1

D x , BT B T + ρ1:T I 1/2 x E

D x , BT B T 1/2 x E + 1

D x , (ρ1:T I)1/2 x E

2 ρ1:T x 2 + 1

D x , (GT )1/2 x E

2D2 ρ1:T + 1

2D2λmax((GT )1/2)

2D2 ρ1:T + 1

2D2 tr(G1/2 T ),

where the first inequality is due to

b and we assume x D.

Therefore, we can get

η ΨT (x ) 1

D2 ρ1:T + D2 tr(G1/2 T ) . (5)

To give the bound for RG, we first derive the lower bound of Gt 1, which connects Gt 1 with Gt. We denote at = max{ ρ1:t 1 gt 2+ρ1:t 1 , 1} 1, and we have

Gt 1 = Bt 1B t 1 + ρ1:t 1Id d

at( gt 2 Id d + ρ1:t 1Id d + Bt 1B t 1)

at( gt 2 Id d + Gt 1)

which means G 1 t 1 1 at G 1 t .

Then, letting C1 = maxt [T ] 1 at , we have

t=1 gt 2 G 1/2 t 1 =

D gt, ( G1/2 t 1) 1gt E

D gt, (G1/2 t ) 1gt E

D gt, (G1/2 t ) 1gt E 2C1 tr(G1/2 T ),

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

where the last inequality is due to Lemma A.2.

Therefore, we can get

t=1 gt 2 G 1/2 t 1 ηC1 tr(G1/2 T ). (6)

By combining inequalities (5) and (6), and setting η = D

2, we obtain the regret bound for FTSL

t=1 ft(x ) ηC1 tr(G1/2 T ) + 1

D2 ρ1:T + D2 tr(G1/2 T )

2ρ1:T + (C1 + 1

2 )D tr(G1/2 T )

= O(tr(G1/2 T ) + ρ1:T ).

Notably, the regret bound of Theorem 4.1 can be reformulated, as also presented in Feinberg et al. (2023).

Corollary A.4. We define λτ:d(GT ) = Pd i=τ λi(GT ), where λi is the i-th eigenvalue of GT , we can rewrite the regret bound of FTSL as

R(T) O tr(G1/2 T ) + p

λτ:d(GT ) .

A.3. Proof of Corollary A.4

We first give a lemma to give the upper bound of the cumulative ρ1:T .

Lemma A.5. (Lemma 1 in Feinberg et al. (2023)) The cumulative escaped masses ρ1:T in FD can be upper bounded as

ρ1:T min k=0,...,τ 1

Pd i=k+1 λi(GT )

i=τ λi(GT ) def = λτ:d(GT ),

where the last inequality is to set k = τ 1.

Combining Theorem 4.1 with Lemma A.5, we can get Corollary A.4.

A.4. Proof of Theorem 4.2

We first introduce some guarantees of the fast frequent directions technique.

In Algorithm 3, we do not perform SVD on the sketching matrix Bt every round. Instead, we maintain two matrices Mt and Vt, which approximate the sketching matrix Bt in Algorithm 2, that is

Vt M 1/2 t = Bt Rd 2τ.

In each round t, after receiving a new gradient gt, we first check whether this vector lies within the subspace spanned by Vt 1. If the vector is not contained within the subspace, we normalize it and subsequently add it to Vt 1, thereby enlarging the span of the subspace and ensuring Vt 1V t 1gt = gt. In our algorithm design, we want the matrix Vt only contains orthonormal vectors, therefore we add gt g

gt g 2 to the first all zero column. In round t, we have

Vt 1Mt 1V t 1 + gtg t = Vt 1Mt 1V t 1 + Vt 1V t 1gtg t Vt 1V t 1 = Vt 1 Mt 1 + V t 1gtg t Vt 1 V t 1 = Vt Mt 1 + V t gtg t Vt V t .

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

According to Lemma A.3, we have Vt Mt V t = Bt B t Gt.

In the following, we will prove Gt Gt, which is important for our analysis.

Assume we delete the eigenvalues of Mk in round k, we calculate Gk before we delete the eigenvalues of Mk in our method. In round k, before we update Vk, we have Vk = Vk 1 and we let Vk to represent the matrix before we delete its columns.

As we perform SVD to Vk M 1/2 k Rd 2τ, we delete the eigenvalues of Mk, update Vk = Vk Uk and set the last τ +1 columns to zero. In round k+1, we have Gk+1 = Vk+1(diag(max{λ(k) [1:2τ] λ(k) τ , 0})+V k+1gk+1(V k+1gk+1) )V k+1+ρ1:k+1Id d, and we can derive

Vk+1(diag(max{λ(k) [1:2τ] λ(k) τ , 0}) + V k+1gk+1(V k+1gk+1) )V k+1 + ρk+1Id d

=Vk+1diag(max{λ(k) [1:2τ] λ(k) τ , 0})V k+1 + λ(k) τ Id d + gk+1g k+1

Vk Ukdiag(max{λ(k) [1:2τ] λ(k) τ , 0})U k V k + λ(k) τ (Vk Uk)(Vk Uk) + gk+1g k+1

=Vk Ukdiag(max{λ(k) [1:2τ] λ(k) τ , 0} + λ(k) τ )U k V k + gk+1g k+1

Vk Ukdiag(λ(k) [1:2τ])Uk V k + gk+1g k+1

=Vk Mk V k + gk+1g k+1,

where the second inequality is due to only first τ 1 columns of diag(max{λ(k) [1:2τ] λ(k) τ , 0}) are non-zero, the first τ 1 columns of Vk+1 and Vk Uk are same, and Uk and Vk are orthogonal matrices.

If we do not delete the eigenvalues of Mk in round k, we have ρk+1 = 0 and

Vk+1Mk+1Vk+1 + ρk+1Id d = Vk+1(Mk + V k+1gk+1(V k+1gk+1) )V k+1 = Vk+1Mk V k+1 + gk+1g k+1.

Assume there are most ℓeigenvalues in Mk, ℓ 2τ 1, and most ℓ+ 1 non-zero columns in Vk+1, most ℓnon-zero columns in Vk and the first ℓcolumns of Vk+1 and Vk are same. We have the following

Vk+1Mk+1Vk+1 + ρk+1Id d = Vk+1Ukdiag(λ(k) [1:2τ])U k V k+1 + gk+1g k+1 + 0Id d

= Vk Ukdiag(λ(k) [1:2τ])U k V k + gk+1g k+1

= Vk Mk V k + gk+1g k+1,

where the second equality is due to only first ℓelements in λ(k) [1:2τ] are non-zero and first ℓcolumns of Vk and Vk+1 are same.

Therefore, we have

Gt = Vt Mt Vt + ρ1:t Id d Vt 1Mt 1Vt 1 + ρ1:t 1Id d + gtg t = Gt 1 + gtg t .

By summing up, we can get

i=1 gig i = Gt.

Next, we need to ensure that after FFD, the preconditioning matrix Gt is monotone. It is natural to verify that if we do not remove the eigenvalues of Mt, Gt remains monotone.

Then, we will prove that Gt remains monotone even if we delete the eigenvalues of Mt. Assume in round k, we delete the eigenvalues of Mk. In round k, the matrix Gk = ρ1:k Id d + Vk Ukdiag(λ(k) [1:2τ])U k V k . And Gk+1 = ρ1:k+1Id d + Vk+1Mk+1V k+1. In round k + 1, since we do not delete the eigenvalues of Mk+1, the first τ 1 columns of Vk+1

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

of round k + 1 and Vk Uk of round k are same (Notably, Vk is different in round k and k + 1). We have Mk+1 = diag(max{λ(k) [1:2τ] λ(k) τ , 0}) + (V k+1gk+1)(V k+1gk+1) . As we set ρk+1 = λ(k) τ , we can ensure

Gk+1 = ρ1:k+1Id d + Vk+1Mk+1V k+1

= ρ1:k+1Id d + Vk+1(diag(max{λ(k) [1:2τ] λ(k) τ , 0}) + (V k+1gk+1)(V k+1gk+1) )V k+1

ρ1:k Id d + λ(k) τ Id d + Vk+1diag(max{λ(k) [1:2τ] λ(k) τ , 0})V k+1

ρ1:k Id d + Vk Ukλ(k) τ Id d U k V k + Vk Ukdiag(max{λ(k) [1:2τ] λ(k) τ , 0})U k V k

ρ1:k Id d + Vk Ukdiag(λ(k) [1:2τ])U k V k

where the second inequality is due to the τ to 2τ columns of diag(max{λ(k) [1:2τ] λ(k) τ , 0}) is zero and the first τ 1 columns of Vk Uk of round k and Vk+1 of round k + 1 are same, so we can replace Vk+1 with Vk Uk, and Vk, Uk are orthogonal matrices.

Therefore, we can ensure Gt is monotone in FTFSL.

Using the Eq (4), we have the following

t=1 ft(x ) 1

η ΨT (x ) | {z } RD

t=1 gt 2 G 1/2 t 1 | {z } RG

As for the term RD, we can derive

η ΨT (x ) = 1

D x , G1/2 T x E

D x , VT MT V T + ρ1:T Id d 1/2 x E

D x , VT MT V T 1/2 x E + 1

D x , (ρ1:T Id d)1/2 x E

(ρ1:T Id d)1/2 x 2 + 1

D x , (GT )1/2 x E

2η D2 ρ1:T + 1

2η D2λmax(G1/2 T )

2η D2 ρ1:T + 1

2η D2 tr(G1/2 T ),

where the second inequality is due to VT MT V T = BT B T GT and we assume x D.

For the term RG, we denote bt = max{ ρ1:t 1 gt 2+ρ1:t 1 , 1} 1, we have

Gt 1 = Vt 1Mt 1V t 1 + ρ1:t 1Id d

bt( gt 2 Id d + ρ1:t 1Id + Bt 1B t 1)

bt( gt 2 Id d + Gt 1)

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Then, letting C2 = maxt [T ] 1 bt , we can get

t=1 gt 2 G 1/2 t 1 =

D gt, ( G1/2 t 1) 1gt E

D gt, (G1/2 t ) 1gt E

D gt, (G1/2 t ) 1gt E C2 tr(G1/2 T ),

where the last inequality is due to Lemma A.2.

Therefore, we have

t=1 gt 2 G 1/2 t 1 ηC2 tr(G1/2 T ). (9)

By combining inequalities (8) and (9), and setting η = D

2, we can derive Theorem 4.2.

2η D2 ρ1:T + 1

2η D2 tr(G1/2 T ) + ηC2 tr(G1/2 T )

O(tr(G1/2 T ) + ρ1:T ).

A.5. Proof of Theorem 4.3

According to the proof of Theorem 4.2, we have the following properties in FFD:

Vt M 1/2 t = Bt,

Vt UtΣt U t V t Gt Gt.

Similar to the proof of Theorem 3 in Feinberg et al. (2023), we have the following lemma: Lemma A.6. Let {xt} be the decision of Fast S-ADA and x arg minx Rd PT t=1 ft(x), the regret bound of Fast S-ADA is

t=1 xt x 2 G1/2 t G1/2 t 1 + η

t=1 gt 2 G 1/2 t .

According to Lemma A.6, we have

t=1 xt x 2 G1/2 t G1/2 t 1 + η

t=1 gt 2 G 1/2 t .

We first bound the term η

2 PT t=1 gt 2 G 1/2 t , which is easier to bound than that in FTFSL, as we can directly apply a lemma on it.

t=1 gt 2 G 1/2 t =

D gt, ( G1/2 t ) 1gt E

D gt, (G1/2 t ) 1gt E

2 tr(G1/2 T ),

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

where the fist inequality is due to G 1 t G 1 t and the last inequality is due to Lemma A.2.

Next, we bound the term 1 2η PT t=1 xt x 2 G1/2 t G1/2 t 1, which introduces the dependence on dimensionality d. We have

t=1 xt x 2 G1/2 t G1/2 t 1 D2 1 2η tr G1/2 T ,

where this inequality is due to monotonicity of Gt and D1 = maxt [T ] xt x .

Then we need to bound the term tr G1/2 T .

Assume we delete the eigenvalues of Mk at round k. In round k, before we update Vk, Vk = Vk 1 and we use Vk to represent the matrix before we delete its columns.

As we perform SVD to Vk M 1/2 k Rd 2τ, we denote V 1:τ k Rd τ to be the first τ columns of Vk Uk before we set the last τ + 1 columns to 0d, Nk Rd (d τ) be the complementary subspace of V 1:τ k , and the first τ 1 columns of Vk+1 and V 1:τ k are same, ρk+1 = λ(k) τ and [V 1:τ k ; Nk][V 1:τ k ; Nk] = Id d.

In round k + 1, we have Gk+1 = Vk+1(diag(max{λ(k) [1:2τ] λ(k) τ , 0}) + V k+1gk+1(V k+1gk+1) )V k+1 + ρ1:k+1Id d, and we can derive

Vk+1(diag(max{λ(k) [1:2τ] λ(k) τ , 0}) + V k+1gk+1(V k+1gk+1) )V k+1 + ρk+1Id d

=Vk+1diag(max{λ(k) [1:2τ] λ(k) τ , 0})V k+1 + λ(k) τ Id d + gk+1g k+1

=Vk Ukdiag(max{λ(k) [1:2τ] λ(k) τ , 0})U k V k + λ(k) τ (V 1:τ k (V 1:τ k ) + Nk N k ) + gk+1g k+1

Vk Ukdiag(max{λ(k) [1:2τ] λ(k) τ , 0})U k V k + λ(k) τ (Vk Uk(Vk Uk) + Nk N k ) + gk+1g k+1

Vk Ukdiag(λ(k) [1:2τ])Uk V k + λ(k) τ Nk N k + gk+1g k+1

=λ(k) τ Nk N k + Vk Mk V k + gk+1g k+1 =ρk+1Nk N k + Vk Mk V k + gk+1g k+1,

where the second equality is due to only first τ 1 columns of diag(max{λ(k) [1:2τ] λ(k) τ , 0}) are non-zero and the first τ 1 columns of Vk+1 and Vk Uk are same, the first inequality is due to V 1:τ k only contains τ orthogonal vectors at most.

If we do not delete the eigenvalues of Mk in round k, we have ρk+1 = 0 can derive

Vk+1Mk+1Vk+1 + ρk+1Id d = Vk+1(Mk + V k+1gk+1(V k+1gk+1) )V k+1 = Vk+1Mk V k+1 + gk+1g k+1 = Vk+1Mk V k+1 + gk+1g k+1 = 0Nk N k + Vk+1Mk V k+1 + gk+1g k+1

= ρk+1Nk N k + Vk+1Ukdiag(λ(k) [1:2τ])U k V k+1 + gk+1g k+1.

Assume there are most ℓeigenvalues in Mk, ℓ 2τ 1, therefore, there are most ℓ+ 1 non-zero columns in Vk+1, most ℓ non-zero columns in Vk and the first ℓcolumns of Vk+1 and Vk are same. We have the following

Vk+1Mk+1Vk+1 + ρk+1Id d = 0Nk N k + Vk+1Ukdiag(λ(k) [1:2τ])U k V k+1 + gk+1g k+1

= 0Nk N k + Vk Ukdiag(λ(k) [1:2τ])U k V k + gk+1g k+1

= ρk+1Nk N k + Vk Ukdiag(λ(k) [1:2τ])U k V k + gk+1g k+1

= ρk+1Nk N k + Vk Mk V k + gk+1g k+1,

where the second equality is due to there are most ℓnon-zero elements in λ(k) [1:2τ] and first ℓcolumns of Vk and Vk+1 are same, and the third equality is due to ρk+1 = 0.

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Therefore, we have Vk+1Mk+1Vk+1 + ρk+1Id d ρk+1Nk N k + Vk Mk V k + gk+1g k+1.

By reduction, we have

GT = VT MT VT + ρ1:T Id d

ρ1:T 1Id d + VT 1MT 1V T 1 + ρT NT N T + g T g T ...

t=1 ρt Nt N t +

t=1 ρt Nt N t + GT .

Then we can rewrite the bound of tr( G1/2 T ) as tr(G1/2 T ) + tr((PT t=1 ρt Nt N t )1/2).

We just need to give bound of tr((PT t=1 ρt Nt N t )1/2). According to Corollary 4 in Feinberg et al. (2023), the upper bound of this term is

t=1 ρt Nt N t )1/2) p

d(d τ)ρ1:T .

Then we can derive the bound of tr( G1/2 T ).

tr( G1/2 T ) tr(G1/2 T ) + tr((

t=1 ρt Nt N t )1/2)

tr(G1/2 T ) + p

d(d τ)ρ1:T .

By setting η = D1

t=1 xt x 2 G1/2 t G1/2 t 1 + η

t=1 gt 2 G 1/2 t

D2 1 2η (tr(G1/2 T ) + p

d(d τ)ρ1:T ) + η tr(G1/2 T )

O tr(G1/2 T ) + p

d(d τ)ρ1:T .

A.6. Proof of Theorem 4.4

In the following, we give the proof of Theorem 4.4.

Due to the update in Algorithm 5 is performed on the matrix space, it poses challenges for the analysis. Therefore, we first introduce an equivalent update in vector form.

Recall the update in Algorithm 5, Xt = η L 1/4 t G X t R 1/4 t . We define Ht = L1/4 t R1/4 t Rmn mn, Lt = ˆLt ˆL t Rm m, Rt = ˆRt ˆR t Rn n, gt = vec(GX t ) and xt = vec(Xt), where vec denotes the row-major vectorization of a given matrix. Kronecker product obeys the following properties as shown in Gupta et al. (2018).

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Lemma A.7. (Lemma 3,4 in Gupta et al. (2018)) For matrices A, A , B, B of appropriate dimensions and vectors x, y, L Rm m, R Rn n, G Rm n, the following properties hold:

1. (A B)(A B ) = (AA ) (BB ).

2. (A B) = A B .

3. A, B 0, (A B) 1 = A 1 B 1.

4. A A , B B , then A B A B .

5. tr(A B) = tr(A) tr(B).

6. vec(xy ) = x y.

7. (L R )vec(G) = vec(LGR).

By defining gt = Pt i=1 gi, we can rewrite the update in Algorithm 5 as

xt+1 = η H 1 t gt,

which is equal to

xt+1 = arg min x Rmn η gt, x + 1

As Lt and Rt is monotonically increasing with t, it is not hard to find that Ht is also monotonically increasing with t. Thus, by using Lemma A.1, we have the similar inequality:

η ΨT (x ) | {z } RD

t=1 gt 2 H 1 t 1 | {z } RG

where ΨT (x ) = 1

2 D x , HT x E .

We first give the bound of RD.

ΨT (x ) = 1

D x , HT x E 1

Then we introduce a lemma to give an equality about the norm of Kronecker product. Lemma A.8. (Theorem 8 in Lancaster & Farahat (1972)) For two matrices A and B, the following equality holds

A B = A B .

We have x = X F DM. According to Lemma A.8, HT = L1/4 T R1/4 T = L1/4 T R1/4 T , then we

need to give the bound of L1/4 T and R1/4 T , respectively. We first define LT = PT t=1(GX t ) GX t + ϵIm m and

RT = PT t=1(GX t ) GX t + ϵIn n. Using Lemma A.3, it is not hard to verify that Lt + ϵIm m Lt, Rt + ϵIn n Rt.

Therefore, we have L1/4 T = (LT + ϵIm m + ρL 1:T Im m)1/4

(LT + ϵIm m)1/4 + (ρL 1:T Im m)1/4

(LT + ϵIm m)1/4 + (ρL 1:T Im m)1/4

(LT + ϵIm m)1/4 + (ρL 1:T )1/4

tr((LT + ϵIm m)1/4) + (ρL 1:T )1/4

tr(L1/4 T ) + (ρL 1:T )1/4,

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

where the fourth inequality is due to have for positive semidefinite matrices tr( ) and last inequality is due to the monotonicity of tr( ).

We also have R1/4 T = (RT + ϵIn n + ρR 1:T In n)1/4

(RT + ϵIn n)1/4 + (ρR 1:T In n)1/4

(RT + ϵIn n)1/4 + (ρR 1:T )1/4In n

(RT + ϵIn n)1/4 + (ρR 1:T )1/4

tr((RT + ϵIn n)1/4) + (ρR 1:T )1/4

tr(R1/4 T ) + (ρR 1:T )1/4.

Therefore, we can get

η ΨT (x ) D2 M 2η (tr(L1/4 T ) + (ρL 1:T )1/4)(tr(R1/4 T ) + (ρR 1:T )1/4). (12)

To give the bound of RG, we first introduce a lemma.

Lemma A.9. (Lemma 8 in Gupta et al. (2018)) If GX t Rm n with rank at most r, and gt = vec(GX t ), then ϵ 0, t,

ϵImn mn + 1

i=1 GX i (GX i ) !1/2

i=1 (GX i ) (GX i )

Then we utilize a lemma in Feinberg et al. (2023). Lemma A.10. (Lemma 14 in Feinberg et al. (2023)) Let VtΣL t V t = Lt 1 + GX t (GX t ) be the eigendecomposition of the un-deflated sketch. We assume rank(ΣL t ) = k, k [τ 1, τ 1 + r]. Write Vt = [V t , V t ], where V t contains the first k columns of Vt. And for the right conditioner WtΣR t W t = Rt 1 + (GX t ) GX t . Write Wt = [W t , W t ], where W t contains the first k columns of Wt. Define N L t = V t (V t ) and N R t = W t (W t ) , then we have

i=1 GX i (GX i ) +

i=1 ρL i N L i + ϵIm m = M L t ,

i=1 (GX i ) GX i +

i=1 ρR i N R i + ϵIn n = M R t .

According to Lemma A.3 and Lemma A.10, we have M L t ϵIm m + Pt i=1 GX i (GX i ) and M R t ϵIn n + Pt i=1(GX i ) GX i . Using Lemma A.7, we can derive

i=1 (GX i ) GX i

Im m M R t ,

i=1 GX i (GX i ) !

In n M L t In n.

Therefore, we have

(ϵImn mn + 1

i=1 gig i )1/2 M L t 1/4 M R t 1/4 L1/4 t R1/4 t = Ht.

Then we define b Ht = rϵImn mn + Pt i=1 gig i 1/2 , and can have

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

We want to give the lower bound of ˆHt 1. By defining ct = rϵ gt 2+rϵ, we have

ˆH2 t 1 = rϵImn mn +

ct( g 2 t Imn mn +

i=1 gig i + rϵImn mn)

ct(rϵImn mn +

i=1 gig i )

= ct ˆH2 t .

Define A1 = mint [T ]( ct). We have A1 ˆHt ˆHt 1 r Ht 1, which means 1 r H 1 t 1 ˆH 1 t 1 1 A1 ˆH 1 t .

We can derive

t=1 gt 2 H 1 t 1 =

D gt, H 1 t 1gt E r

D gt, ˆH 1 t 1gt E

D gt, ˆH 1 t gt E .

Based on the proof of Theorem 5 in Feinberg et al. (2023), we obtain the following inequality.

D gt, ˆH 1 t gt E 2 tr( ˆHT ).

Therefore, we have

t=1 gt 2 H 1 t 1 r

D gt, ˆH 1 t gt E 2

r tr( ˆHT ).

Then, we need to bound the term ˆHT .

ˆH2 t = (rϵImn mn +

i=1 gig i ) r

i=1 GX i (GX i ) !1/2

i=1 (GX i ) GX i

= r L1/2 T R1/2 T ,

which means ˆ Ht r L1/4 T R1/4 T .

By defining C3 = 1 A1 , we can derive the bound of RG.

t=1 gt 2 H 1 t 1 ηC3 r tr( ˆHT )

ηC3r tr(L1/4 T ) tr(R1/4 T ).

By setting η = DM r , the final regret bound is

t=1 ft(xt) ft(x ) ηC3r tr(L1/4 T ) tr R1/4 T ) + D2 M 2η (tr(L1/4 T ) + (ρL 1:T )1/4)(tr(R1/4 T ) + (ρR 1:T )1/4)

r DMC3 tr(L1/4 T ) tr(R1/4 T ) + r DM

2 (tr(L1/4 T ) + (ρL 1:T )1/4)(tr(R1/4 T ) + (ρR 1:T )1/4)

= O r(tr(L1/4 T ) + (ρL 1:T )1/4)(tr(R1/4 T ) + (ρR 1:T )1/4) .

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

B. Online to Batch Reduction

Algorithm 6 Online to Batch Conversion

1: Input: Time horizon T, rounds N, smoothness parameter L, OCO method A 2: Initialize x1 to be any point in the domain 3: for t = 1 to T do 4: Construct ft(x) = f(x) + L x xt 2

5: Set x1 t = xt and pass x1 t to A 6: for i = 1 to N do 7: Play xi t, derive the gradient ft(xt; ξt), and construct gi t(x) = ft(xt; ξt) x 8: Send gi t(x) to A and receive xi+1 t 9: end for 10: Update xt+1 = 1

N PN i=1 xi t 11: end for 12: Return xk = arg mink [T +1] f(xt)

In this section, we give some details for the reduction of non-convex stochastic optimization to online convex optimization for completeness. We use the framework of Agarwal et al. (2019), which Feinberg et al. (2023) also adopt before. Under this framework, we optimize a non-convex loss function f(x) through construcing a new loss function ft(x) = f(x) + L x xt 2, which is strongly convex. In each round, we pass the loss function ft(x) to any OCO method, A, (it can be any algorithm in this paper), and use A to optimize it for N rounds. When deriving the stochastic gradient, we use a batch ξt to derive ft(xt; ξt), which satisfy E[ ft(xt; ξt)] = ft(xt) and E[ ft(xt; ξt) ft(xt) 2] σ2. The algorithm is stated in Algorithm 6, and we provide the convergence guarantees in the following. We first define the adaptive ratio.

Definition B.1. We denote x A be the output of an OCO method A and x arg minx Rd f(x), we define the adaptive ratio of A as

µA(f) = f(x A) f(x )

Then we provide the convergence of this reduction.

Theorem B.2. (Theorem A.2 in Agarwal et al. (2019)) We assume f(x) is L-smooth, 2f(x) L, maxx,y f(x)

f(y) F, E[ ft(xt; ξt) ft(xt) 2] σ2, and µ = maxt µA(ft). By setting T = 12ML

ϵ2 and N = 48µ2σ2

ϵ2 , the output of Algorithm 6 satisfies

E [ f(x t ) ] ϵ.

It is evident that the total number of queries to the stochastic gradient oracle is O(µ2σ2/ϵ4).

By using this framework, we can translate the regret bound of an OCO algorithm into convergence guarantees for stochastic optimization.

C. Full experiments

In this section, we conduct empirical studies to evaluate our proposed algorithms. In online classification task, we compare our methods with ADA-DIAG (Duchi et al., 2011), RADAGRAD (Krummenacher et al., 2016), FD-SON (Luo et al., 2018), ADA-FFD under two frameworks (Wan & Zhang, 2022) and S-ADA (Feinberg et al., 2023). In image classification task and language modeling task, we compare our methods with ADA-DIAG, ADA-FFD under two frameworks, SADA, Shampoo (Gupta et al., 2018), S-Shampoo (Feinberg et al., 2023). When it comes to hyper-parameter tuning, we either set the hyper-parameters as recommended in the original papers or tune them by grid search. For example, for learning rate η and regularizer parameter ϵ, we search them from the set {1e 5, 1e 4, 1e 3, 1e 2, 1e 1, 1, 5} and {1e 5, 1e 4, 1e 3, 1e 2, 1e 1, 1, 5}, respectively, and select the best one. All experiments are conducted on 8 NVIDIA 3090 GPUs.

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

0 500 1000 1500 2000 # of iterations

0 500 1000 1500 2000 # of iterations

0 500 1000 1500 2000 # of iterations

ADA-DIAG RADAGRAD FD-SON ADA-FFD (M) ADA-FFD (P) S-ADA Fast S-ADA FTFSL

Figure 3. Results for Gisette dataset.

0 2000 4000 # of iterations

0 2000 4000 # of iterations

0 2000 4000 # of iterations

ADA-DIAG RADAGRAD FD-SON ADA-FFD (M) ADA-FFD (P) S-ADA Fast S-ADA FTFSL

Figure 4. Results for Epsilon dataset.

0 50 100 150 200 # of iterations

Testing accuracy

0 50 100 150 200 # of iterations

Testing loss

0 50 100 150 200 # of iterations

Training loss (log scale)

0 50 100 150 200 # of iterations

ADA-DIAG ADA-FFD (M) ADA-FFD (P) S-ADA Shampoo S-Shampoo FTFSL FTSL-Shampoo

Figure 5. Results for CIFAR-10 dataset.

C.1. Online Classification

First, we perform online classification to evaluate the performance of our methods with two real world datasets from LIBSVM (Chang & Lin, 2011) repository: Gisette and Epsilon, which are high-dimensional and dense. Particularly, Gisette contains 6000 training samples and 1000 testing samples, with 5000 features. Epsilon dataset consists of 400, 000 training samples and 100, 000 testing samples, with 2000 features. Let T denote the number of total rounds. In each round t [T], a batch of training examples {(wt,1, yt,1) , . . . , (wt,n, yt,n)} arrive, where (wt,i, yt,i) [ 1, 1]d { 1, 1}, i = 1, . . . , n. The online learner aims to predict a linear model xt and suffers the hinge loss ft(xt) = 1

n Pn i=1 max{0, 1 ytx t wt,i}.

Setup. For Gisette, we set the batch size n = 32, the τ = 50 to be 1% of the original dimensionality, and T = 2000. For

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

0 50 100 150 200 # of iterations

Testing accuracy

0 50 100 150 200 # of iterations

Testing loss

0 50 100 150 200 # of iterations

Training loss (log scale)

0 50 100 150 200 # of iterations

ADA-DIAG ADA-FFD (M) ADA-FFD (P) S-ADA Shampoo S-Shampoo FTFSL FTSL-Shampoo

Figure 6. Results for CIFAR-100 dataset.

Epsilon, we set the batch size n = 128, the τ = 20 and T = 5000 to pass through all the training data.

Results. Following Duchi et al. (2011), we adopt the performance of accuracy on the testing data to compare different methods. To better demonstrate the improvements of our methods, we additionally plot the loss and runtime (measured in seconds) of various methods. From Figure 3 and Figure 4, we observe that FTFSL outperforms all other methods in both loss and testing accuracy, aligning with its superior regret bound. Moreover, FTFSL and Fast S-ADA exhibit significantly lower runtimes compared to S-ADA, owing to their superior time complexities.

C.2. Image Classification

In this section, we conduct numerical experiments on multi-class image classification tasks to evaluate the performance of the proposed methods, we compare FTFSL and FTSL-Shampoo with several baseline methods. The experiments involve training Res Net18 and Res Net34 models (He et al., 2016) on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009), respectively, for 200 iterations with batch size of 128.

Setup. For ADA-FFD, S-ADA, FTFSL, the sketching size τ is determined based on the dimensionality of the flattened gradient, which is defined as: τ = min{ d 0.1 , 100},

where d represents the total elements of parameters in each layer. We dynamically set the upper bound of the sketching size based on the dimensionality of each layer. For S-Shampoo and FTSL-Shampoo, due to its memory efficiency, we set τ = 0.1 di , where di is the dimensionality of the i-th dimension of a gradient. For the sake of fairness, we do not employ momentum trick.

Results. We plot the loss value and the accuracy against the iterations on the CIFAR-10 and CIFAR-100 in Figure 5 and Figure 6, respectively. It is observed that, for training loss and testing accuracy, our FTSL-Shampoo achieves comparable performance with respect to Shampoo, while significantly improving memory efficiency and reducing running time, which aligns with the theoretical guarantees. Additionally, our FTFSL converges more quickly than other sketching based algorithms, indicating the effectiveness of the proposed method. Moreover, we also present the running time of each method. FTFSL demonstrates a significant reduction in running time compared to S-ADA, owing to its improved time complexity.

0 10 20 30 40 # of iterations

Testing perplexity

0 10 20 30 40 # of iterations

Testing loss

0 10 20 30 40 # of iterations

Training loss

0 10 20 30 40 # of iterations

ADA-DIAG ADA-FFD (M) ADA-FFD (P) S-ADA Shampoo S-Shampoo FTFSL FTSL-Shampoo

Figure 7. Results for Wiki Text-2 dataset.

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

C.3. Language Modeling Task

In this section, we perform experiments on language modeling task. Concretely, we train a 2-layer Transformer (Vaswani et al., 2017) over the Wi Ki-Text2 dataset (Merity, 2016). We use 256 dimensional word embeddings, 256 hidden unites and 2 heads. We also clip the gradients by norm 0.5 in case of the exploding gradient. The batch size is set as 64 and all methods are trained for 40 epochs with dropout rate 0.1.

Setup. The experimental setup follows that of image classification. For computational efficiency, we do not employ a preconditioning matrix in the embedding layer.

Results. We report the loss, perplexity and the run time in Figure 7. As can be seen, FTFSL and FTSL-Shampoo suffer lower loss and obtain better perplexity compared to other sketching based algorithms, indicating the effectiveness of the proposed methods. Moreover, FTFSL exhibits markedly improved efficiency over S-ADA.