# matrixfree_preconditioning_in_online_learning__fae2904a.pdf

Matrix-Free Preconditioning in Online Learning

Ashok Cutkosky 1 Tamas Sarlos 1

We provide an online convex optimization algorithm with regret that interpolates between the regret of an algorithm using an optimal preconditioning matrix and one using a diagonal preconditioning matrix. Our regret bound is never worse than that obtained by diagonal preconditioning, and in certain setting even surpasses that of algorithms with full-matrix preconditioning. Importantly, our algorithm runs in the same time and space complexity as online gradient descent. Along the way we incorporate new techniques that mildly streamline and improve logarithmic factors in prior regret analyses. We conclude by benchmarking our algorithm on synthetic data and deep learning tasks.

1. Online Learning

This paper considers the online linear optimization (OLO) problem. An OLO algorithm chooses output vectors wt Rd in response to linear losses ℓt(w) = gt w for some gt Rd. Performance is measured by the regret (Shalev Shwartz et al., 2012; Zinkevich, 2003):

t=1 ℓt(wt) ℓt( w) =

t=1 gt (wt w)

OLO algorithms are important because they also solve solve online convex optimization problems, in which the losses ℓt need be only convex by virtue of taking gt to be a gradient (or subgradient) of the tth convex loss. Even better, these algorithms also solve stochastic convex optimization problems by setting ℓt to be the tth minibatch loss and w to be the global minimizer (Cesa-Bianchi et al., 2004). Due to both the simplicity of the linear setting and the power of the resulting algorithms, OLO has become a successful and popular framework for designing and analyzing many of the algorithms used to train machine learning models today.

1Google Research, California, USA. Correspondence to: Ashok Cutkosky <cutkosky@google.com>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

Our goal is to obtain adaptive regret bounds so that RT ( w) may be much smaller in easier problems while still maintaining optimal worst-case guarantees. One relevant prior result is the diagonal preconditioner approach of Adagrad-style algorithms (Duchi et al., 2011; Mc Mahan & Streeter, 2010):

t=1 g2 t,i (1)

where gt,i indicates the ith coordinate of gt. This bound can be achieved via gradient descent with learning rates properly tuned to the value of w , and algorithms of this ﬂavor have found much use in practice. Similar regret bounds that do not require tuning to the value of w can be obtained by making use of a Lipschitz assumption gt G, leading to a bound of the form (Cutkosky & Orabona, 2018):

v u u tlog( d| wi|T

where ϵ is a free parameter representing the regret at the origin . The extra logarithmic factor is an unavoidable penalty for this extra adaptivity (Mc Mahan & Streeter, 2012). This bound has the advantage that by Cauchy-Schwarz it is at

most a logarithmic factor away from w 2 q PT t=1 gt 2 2,

while the diagonal Adagrad bound may be a factor of

d worse due to the w instead of w 2.

Another type of bound is the full-matrix Adagrad bound (Duchi et al., 2011)

RT ( w) w 2Tr

t=1 gtg T t

or the more recent improved full-matrix bounds of (Cutkosky & Orabona, 2018; Koren & Livni, 2017):

t=1 gt, w 2 (4)

The above bound (4) may be much better than the diagonal bound, but unfortunately the algorithms involve manipulating a d d matrix (often called a preconditioning matrix )

Matrix-Free Preconditioning in Online Learning

and so require O(d2) time per update. Our goal is to design an algorithm that maintains O(d) time per update, but still manages to smoothly interpolate between the diagonal bound (2) and full-matrix bounds (4).

Efﬁciently approximating the performance of full-matrix algorithms is an active area of research. Prior approaches include approximations based on sketching, low-rank approximations, and exploiting some assumed structure in the gradients (Luo et al., 2016; Gupta et al., 2018; Agarwal et al., 2018; Gonen & Shalev-Shwartz, 2015; Martens & Grosse, 2015). The typical approach is to trade off computational complexity for approximation quality by maintaining some kind of lossy compressed representation of the preconditioning matrix. The properties of these tradeoffs vary: for some strategies one may obtain worse-regret than a nonfull matrix algorithm (or even linear regret) if the data is particularly adversarial, while for others one may be unable to see nontrivial gains without signiﬁcant complexity penalties. Our techniques are rather different, and so we make no complexity tradeoffs, never suffer worse regret than diagonal algorithms, and yet still obtain full-matrix bounds in favorable situations.

This paper is organized as follows: In Section 2 we give some background in online learning analysis techniques that we will be using. In Sections 3-5 we state and analyze our algorithm that can interpolate between full-matrix and diagonal regret bounds efﬁciently. In Section 6 we provide an empirical evaluation of our algorithm.

2. Betting Algorithms

A recent technique for designing online algorithms is via the wealth-regret duality approach (Mc Mahan & Orabona, 2014) and betting algorithms (Orabona & P al, 2016). In betting algorithms, one keeps track of the wealth :

Wealth T = ϵ

where ϵ > 0 is some user-deﬁned hyperparameter. The goal is to make the wealth as big as possible, because

RT ( w) = ϵ Wealth T

and in some sense the wealth is the only part of the above expression that the algorithm actually has control over.

Speciﬁcally, we want to obtain a statement like:

for some function f, which exists only for analysis purposes

here. Given this inequality, we can write:

RT ( w) = ϵ

T =1 gt w Wealth T

T =1 gt w f

ϵ + sup X R X w f(X)

= ϵ + f ( w )

where f is the Fenchel conjugate, deﬁned by f (u) = supx xu f(x). Formally, we have:

Lemma 1. If Wealth T f PT t=1 gt w w for some arbi-

trary norm and function f, then RT ( w) ϵ+f ( w ).

One way to increase the wealth is to view the vectors wt as some kind of bet and gt as some kind of outcome (e.g. imagine that wt is a portfolio and gt is a vector of returns). Then the amount of money you win at time t is gt wt and so Wealth T is the total amount of money you have at time T, assuming you started out with ϵ units of currency.

In order to leverage this metaphor, we make a critical assumption: gt 1 for all t. Here is the dual norm, g = sup w 1 g w (e.g. when is the 2-norm, is also the 2-norm, and when is the inﬁnity-norm, is the 1-norm). There is nothing special about 1 here; we may choose any constant, but use 1 for simplicity.

Under this assumption, guaranteeing RT (0) ϵ is equivalent to never going into debt (i.e. Wealth T < 0). We assure this by never betting more than we have: wt Wealtht 1. In fact, in order to simplify subsequent calculations, we will ask for a somewhat stronger restriction:

2Wealtht 1 (5)

Given (5), we can also write

wt = vt Wealtht 1

where vt is a vector with vt < 1/2, which we call the betting fraction . vt is a kind of learning rate analogue. However, in the d-dimensional setting vt is only a d-dimensional vector, while previous full-matrix algorithms use a learning rate analogue that is a d d matrix.

2.1. Constant vt

To understand the potential of this approach, consider the case of a ﬁxed betting fraction vt = v. Using the inequality log(1 x) x x2 for all x 1/2, we proceed:

Matrix-Free Preconditioning in Online Learning

Wealth T = ϵ

t=1 (1 gt v)

log(Wealth T ) = log(ϵ) +

t=1 log(1 gt v) (6)

t=1 gt v (gt v)2 (7)

Now if we set

PT t=1 gt w w

2 PT t=1 gt w w + 2 PT t=1(gt w w )2

Wealth T ϵ exp

PT t=1 gt w w 2

4 PT t=1 gt w w + 4 PT t=1(gt w w )2

Where f(x) = ϵ exp

4|x|+4 PT t=1(gt w w )2

bound f by Lemma 19 of (Cutkosky & Orabona, 2018):

RT ( w) ϵ + f ( w ) O

t=1 (gt w)2

This is actually a factor of up to

d better than the fullmatrix guarantee (4) and, more importantly, there are no matrices in this algorithm! Instead, the role of the preconditioner is played by the vector v, which corresponds to a kind of optimal direction .

2.2. Varying vt

Now that we know that there exists a good ﬁxed betting fraction v given oracle tuning, we turn to the problem of using varying vt. To do this we use the reduction developed by Cutkosky & Orabona (2018) for recasting the problem of choosing vt as itself an online learning problem. The ﬁrst step is to calculate the wealth with changing vt:

log(Wealth T ) = log(ϵ) +

t=1 log(1 gt vt) (8)

Next, denote the wealth of the algorithm that uses a ﬁxed fraction v as Wealth T ( v) and then subtract (8) from (6):

log(Wealth T ( v)) log(Wealth T )

t=1 log(1 gt v) log(1 gt vt)

t=1 ℓt(vt) ℓt( v)

where we deﬁne ℓt(x) = log(1 gt x). Notice that ℓt is convex, so we can try to ﬁnd v by using an online convex optimization algorithm that outputs vt in response to the loss ℓt. Let Rv T be the regret of this algorithm. Then by deﬁnition of regret, for any v:

log(Wealth T ( v)) log(Wealth T ) = Rv T ( v)

Combining the above with inequality (7) we have

log(Wealth T ) = log(Wealth T ( v)) Rv T ( v)

t=1 gt v (gt v)2 Rv T ( v) (9)

So now need to ﬁnd a v that maximizes this expression.

Our analysis diverges from prior work at this point. Previously, (Cutkosky & Orabona, 2018) observed that ℓt is expconcave, and so by using the Online Newton Step (Hazan et al., 2007) algorithm one can make Rv T ( v) = O(log(T)) and obtain regret

v u u tlog( w T 4.5

Instead, we take a different strategy by using recursion. The idea is simple: we can apply the exact same reduction we have just outlined to design an inner coin-betting strategy for choosing vt and minimizing Rv T . The major subtlety that needs to be addressed is the restriction vt 1/2. Fortunately, (Cutkosky & Orabona, 2018) also provides a black-box reduction that converts any unconstrained optimization algorithm into a constrained algorithm without modifying the regret bound, and so we can essentially ignore the constraint on vt in our analysis.

3. Recursive Betting Algorithm

The key advantage of using a recursive strategy to choose vt is that the regret Rv T ( v) may depend strongly on v . Since in many cases v is small, this results in better overall performance than if we were to directly apply the Online Newton Step algorithm. We formalize this strategy and intuition in Algorithm 1 and Theorem 1.

Matrix-Free Preconditioning in Online Learning

Algorithm 1 Recursive Optimizer

1: procedure RECURSIVEOPTIMIZER(ϵ) 2: Wealth0 ϵ. 3: Initialize INNEROPTIMIZER. 4: for t = 1 . . . T do 5: Let vt be the tth output of INNEROPTIMIZER. 6: wt Wealtht 1vt. 7: Output wt, receive gt. 8: Wealtht Wealtht 1 gt wt. 9: zt gt 1 gt vt = d dvt log(1 gt vt).

10: Send zt as tth gradient to INNEROPTIMIZER. 11: end for 12: end procedure

Theorem 1. Suppose gt 1 for some norm for all t. Further suppose that INNEROPTIMIZER satisﬁes vt 1/2 and guarantees regret nearly linear in v :

Rv T ( v) =

t=1 zt vt zt v ϵ + v GT ( v/ v )

for some function GT ( v/ v ) for any v with v 1/2. Then if PT t=1 gt w w 2GT ( w/ w ), Algorithm 1 obtains

t=1 (gt w)2

and otherwise

RT ( w) ϵ + 2 w GT ( w/ w )

Let us unpack the condition PT t=1 gt w w 2GT ( w/ w ). First we consider the LHS. Observe that PT t=1 w w gt is the regret at w/ w of an algorithm that always predicts 0. In a classic adversarial problem we should expect this value to grow as Ω(T). Even in the case that each gt is an i.i.d. mean-zero random variable, we should expect growth of at least Ω(

T). For the RHS, observe that so long as INNEROPTIMIZER obtains the optimal T-dependence in its regret bound, we should expect GT = O(

T) - for example the algorithm of (Cutkosky & Orabona, 2018) obtains GT ( v/ v ) =

O q PT t=1 gt 2 log(T 4.5/ϵ) for any v 1/2.

Therefore the condition PT t=1 gt w w 2GT ( w/ w ) can be viewed as saying that the gt in some sense violate standard concentration inequalities and so are clearly not mean-zero random variables: intuitively, there is some amount of signal in the gradients.

As a simple concrete example, suppose the gt are i.i.d. random vectors with covariance Σ and mean ϵx, where

x is the eigenvector of Σ with smallest eigenvalue. Then PT t=1 gt x will grow as Θ(T ϵ), and so for sufﬁciently large T we will obtain the full-matrix regret bound where RT (x) grows with PT t=1(gt x)2. This has expectation x T Σx T + ϵT = (λd + ϵ)T, where λd is the smallest eigenvalue of Σ. In contrast, a standard regret bound may depend on PT t=1 gt 2 2. This has expectation Trace(Σ)T + ϵT, which is a factor of d larger for small ϵ, and even more if Σ is poorly conditioned.

Next, let us consider the second case in which the regret bound is O(ϵ + w GT ). This bound is also actually a subtle improvement on prior guarantees. For example, if INNEROPTIMIZER guarantees regret Rv T ( v)

log( v T/ϵ) PT t=1 gt 2 , we can use the fact that

v 1/2 to bound GT by q

log(T/ϵ) PT t=1 gt 2 . Thus the bound w GT is better than previous regret bounds like (10) due to removing the w from inside the log.

In summary, we improve prior art in two important ways:

1. When the sum of the gradients is greater than Ω(

T), we obtain the optimal full-matrix regret bound.

2. When the sum of the gradients is smaller, our regret bound grows only linearly with w , without any p

log( w ) factor.

Both of these improvements appear to contradict lower bounds. First, (Luo et al., 2016) suggests that the factor

d is necessary in a full-matrix regret bound, which seems to rule out improvement 1. Second, (Mc Mahan & Orabona, 2014; Orabona, 2013) state that a p

log( w ) factor is required when w is unknown, appearing to rule out improvement 2. We are consistent with these results because of the condition PT t=1 gt w w 2GT ( w/ w ). Both lower bounds use gt whose coordinates are random 1. However, the bound of (Luo et al., 2016) involves a typical sequence , which concentrates appropriately about zero and does not satisfy the condition to have our improved full-matrix bound. In contrast, the bounds of (Mc Mahan & Orabona, 2014; Orabona, 2013) are stated for 1-dimensional problems and rely on anti-concentration, so that the adversarial sequence is very atypical and does satisfy the condition, yielding our full-matrix bound that does include the log factor.

In Section 4 we propose a diagonal algorithm for use as INNEROPTIMIZER. This will enable Algorithm 1 to interpolate between a diagonal regret bound and the full-matrix guarantee. At ﬁrst glance, this phenomenon is somewhat curious: how can an algorithm that keeps only per-coordinate state manage to adapt to the covariance between pairs of coordinates? The answer lies in the gradients supplied to the INNEROPTIMIZER: gt 1 gt vt . The denominator of this expression actually contains information from all coordinates,

Matrix-Free Preconditioning in Online Learning

and so even when INNEROPTIMIZER is a diagonal algorithm it still has access to interactions between coordinates.

Now we sketch a proof of Theorem 1. We will drop constants, logs and ϵ and leave full details to Appendix B.

Proof Sketch of Theorem 1. We start from (9) and use our assumption on the regret bound of INNEROPTIMIZER:

log(Wealth T )

t=1 gt v (gt v)2 v GT ( v/ v )

for all v. So now we choose v to optimizes the bound.

Let us suppose that v is of the form v = x w w for some x so that v/ v = w/ w . We consider two cases: either PT t=1 gt w w 2GT ( w/ w ) or not.

Case 1 PT t=1 gt w/ w 2GT ( w/ w ):

In this case we have

t=1 gt v v GT ( v/ v ) 1

Therefore we have

log(Wealth T )

2gt v (gt v)2

So now using essentially the same argument as in the ﬁxed v case, we end up with a full-matrix regret bound:

RT ( w) = O

t=1 (gt w)2

Case 2 PT t=1 gt w/ w < 2GT ( w/ w )

In this case, observe that since we guarantee Wealth T > 0 no matter what strategy is used to pick vt, we have

RT ( w) = ϵ Wealth T

ϵ + 2 w GT ( w/ w )

And so we are done.

4. Diagonal INNEROPTIMIZER

As a speciﬁc example of an algorithm that can be used as INNEROPTIMIZER, we provide Algorithm 2. This algorithm will achieve a regret bound similar to (2). Here we use clip(x, a, b) to indicate truncating x to the interval [a, b]. Algorithm 2 works by simply applying a separate

1-dimensional optimizer on each coordinate of the problem. Each 1-dimensional optimizer is itself a coin-betting algorithm that uses Follow-the-Regularized leader (Hazan et al., 2016) to choose the betting fractions vt. There are also two important modiﬁcations at lines 7 and 12-15 that implement the unconstrained-to-constrained reduction.

Algorithm 2 Diagonal Betting Algorithm

1: procedure DIAGOPTIMIZER(ϵ, η) 2: Wealth0,i ϵ for i {1, . . . , d}. 3: A0,i 5 and v1,i 0 for i {1, . . . , d}. 4: for t = 1 . . . T do 5: for i = 1 . . . d do 6: xt,i vt,i Wealtht 1,i. 7: Set wt,i = clip(xt,i, 1/2, 1/2). 8: end for 9: Output wt = (wt,1, . . . , wt,d). 10: Receive gt with gt,i [ 1, 1] for all i. 11: for i = 1 . . . d do 12: gt,i gt,i 13: if gt,i(xt,i wt,i) < 0 then 14: gt,i 0. 15: end if 16: Wealtht,i Wealtht 1,i xt,i gt,i.

17: zt,i d dvt,i log(1 gt,ivt,i) = gt,i 1 gt,ivt,i .

18: At,i At 1,i + z2 t,i.

19: vt,i clip 2η Pt t =1 zt ,i At,i , 1/2, 1/2

20: end for 21: end for 22: end procedure

Before analyzing DIAGOPTIMIZER, we perform a second analysis of RECURSIVEOPTIMIZER that makes no restrictions on INNEROPTIMIZER. We will eventually see that Algorithm 2 is essentially an instance of RECURSIVEOPTIMIZER and so this Lemma will be key in our analysis: Lemma 2. Suppose gt 1 for all t. Suppose INNEROPTIMIZER satisﬁes vt 1/2 and has regret Rv T ( v). Then RECURSIVEOPTIMIZER obtains regret

RT ( w) inf c [0,1/2] ϵ + w

+ w c Z + w

where Z = PT t=1 gt w w 2 .

In words, we have written the regret of RECURSIVEOPTIMIZER as a kind of tradeoff between Z, which is proportional to the quantity inside the square root of a full-matrix bound, and the regret of the INNEROPTIMIZER. This makes it easier to compute the regret when INNEROPTIMIZER s regret bound does not satisfy the conditions of Theorem 1.

Matrix-Free Preconditioning in Online Learning

Theorem 2. Suppose gt 1 for all t. Then Algorithm 2 guarantees regret RT ( w) at most:

i=1 | wi| max

v u u u t Gi

| wi|Gη i q

| wi|Gη i p

for all w with w 1/2, where Gi = PT t=1 g2 t,i. Further, by using ϵ/d instead of ϵ and setting η = 1/2, we can also re-write this as:

RT ( w) ϵ + w GT ( w/ w ),

i=1 |xi| max

Gi log d Zi

Let us brieﬂy unpack this bound. Ignoring log factors to gain

intuition, the bound is Pd i=1 | wi| q PT t=1 g2 t,i. Note that this improves upon the diagonal Adagrad bound (1) by virtue of depending on each | wi| rather than the norm w , and

by Cauchy-Schwarz it is bounded by w 2 q PT t=1 gt 2 2, which matches classic dimension-free bounds. Note however that this bound is not strictly dimension-free due to the dϵ term. Even setting ϵ = 1/d will incur a log(d) penalty due to the log(1/ϵ) factor. Most importantly, however, DIAGOPTIMIZER satisﬁes the conditions on INNEROPTIMIZER in Theorem 1.

Theorem 2 is also notable for its logarithmic factor, which can be made O(log(| wi|Gη+1/2 i /ϵ η)) for any η. This is an improvement over prior bounds such as (Cutkosky & Orabona, 2018) in that the power of the O(T) term Gi inside the logarithm is smaller However, the optimal value for this exponent is 1/2 (Mc Mahan & Orabona, 2014), which this bound cannot obtain. Instead, we show in Appendix C.1 that a simple doubling-trick scheme does allow us to obtain the optimal rate. To our knowledge this is the ﬁrst time such a rate has been achieved: prior works achieve the optimal log factor, but have worse adaptivity to the values of gt, depending on T or PT t=1 |gt| instead of PT t=1 g2 t (Mc Mahan & Orabona, 2014; Orabona, 2014).

Proof Sketch of Theorem 2. First, we observe that Algorithm 2 is running d copies of a 1-dimensional optimizer. Because we have

t=1 gt (wt w) =

t=1 gt,i(wt,i wi)

we may analyze each dimension individually and then sum the regrets. So let us focus on a single 1-dimensional optimizer, and drop all i subscripts for simplicity.

Next, we address the truncation of wt and modiﬁcations to gt. This is a 1-dimensional specialization of the unconstrained-to-constrained reduction of (Cutkosky & Orabona, 2018). Let gt be the (original, unmodiﬁed) gradient, and let gt be the modiﬁed gradient (so gt = gt or gt = 0). A little calculation shows that

gt(xt w) gt(wt w)

for any w [ 1/2, 1/2]. Therefore, the regret PT t=1 gt(wt w) is upper-bounded by PT t=1 gt(xt w). This quantity is simply the regret of an algorithm that uses gradients gt and outputs xt. Now we interpret xt as the predictions of a coin-betting algorithm that uses betting fractions vt in response to the gradients gt. Thus we may analyze the regret of the xt with respect to the gt using coin-betting machinery. To this end, observe that At = 5 + Pt t =1 z2 t , so that

vt+1 = argmin v [ 1/2,1/2]

t =1 zt v + v2

Since zt is the derivative of log(1 gtv) evaluated at vt, we see that vt are the outputs of a Follow-the Regularized-Leader (FTRL) algorithm with regularizers v2 1

4η(5 + Pt t=1 z2 t ). That is, we are actually using Algorithm 1 with INNEROPTIMIZER equal to this FTRL algorithm. Using the FTRL analysis of (Mc Mahan, 2017), we then have

Rv T ( v) v2

ηz2 t 5 + Pt 1 t =1 z2 t

Next, by convexity we have log(a)+b/(a+4) log(a+b) for all a > 0, 0 < b < 4. Since |wt| 1/2 and |gt| 1, z2 t 4. Therefore by induction we can show:

z2 t 5+Pt 1 t =1 z2 t log

so that since |zt| 2| gt| 2|gt|,

Rv T ( v) v2

Therefore by Lemma 2 we have

RT ( w) ϵ + | w|

c (log(| w|/cϵ) 1) + | w|c Z

= ϵ + O | w|

c log | w|Gη

Matrix-Free Preconditioning in Online Learning

for all c [0, 1/2], where we have observed that Z = PT t=1 g2 t = G in one dimension, and dropped various constants for simplicity. Optimizing for c we have

RT ( w) ϵ + O

v u u u t G

The statement for GT comes from simply observing that | w| 1/2.

5. Full Regret Bound

Now we combine the DIAGOPTIMIZER of the previous section with RECURSIVEOPTIMIZER. There are only a few details to address. First, since the analysis of DIAGOPTIMIZER is speciﬁc to the inﬁnity-norm, we set to be the inﬁnity-norm in RECURSIVEOPTIMIZER and Theorem 1, which has 1 as dual norm. Second, note that the gradients provided to INNEROPTIMIZER satisfy zt = gt/(1 gt vt) 2 gt 2 since 1 gt 1 gt . Since the analysis of DIAGOPTIMIZER requires gradients bounded by 1, we rescale the gradients by a factor of 2, which scales up the regret by the same constant factor of 2. Therefore, we see that DIAGOPTIMIZER satisﬁes the hypotheses of Theorem 1 with

GT (v/ v ) = O

Gi log d Gi

Thus by Theorem 1, in all cases we have the diagonal bound:

and whenever PT t=1 gt w w 2GT ( w/ w ) = O(

T), we have

t=1 (gt w)2

Note that this may be even a factor of

d better than the standard full-matrix regret bounds (3), (4).

We recall that by Cauchy-Schwarz, the bound (11) also implies:

which is the standard non-diagonal adaptive regret bound. Note that this may be a factor of

d better than the diagonal Adagrad bound (1) when both w and the gradients are dense.

6. Experiments

We implemented RECURSIVEOPTIMIZER in Tensor Flow (Abadi et al., 2016) and ran benchmarks on both synthetic data as well as several deep learning tasks (see Appendix D for full details).1 We found that using the recent algorithm SCINOL of (Kempka et al., 2019) as the inner optimizer instead of Algorithm 2 provided qualitatively similar results on synthetic data but better empirical performance on the deep learning tasks, so we report results using SCINOL as the inner optimizer. We stress that since SCINOL has essentially the same regret bound as in Theorem 2 (with slightly worse log factors), this substitution maintains our theoretical guarantees while allowing us to inherit the scaleinvariance properties of SCINOL. This actually highlights another advantage of our reduction: we can take advantage of orthogonal advances in optimization.

Our synthetic data takes the form ℓt(w) = |xt w xt w| where xt is randomly generated from a 100-dimensional N(0, Σ) and w is some pre-selected optimal point. We generate Σ to be a random matrix with exponentially decaying eigenvalues and condition number 750. We consider two cases: either w is the eigenvector with smallest eigenvalue of Σ, or the eigenvector with largest eigenvalue. We compared RECURSIVEOPTIMIZER to diagonal ADAGRAD, both of which come with good theoretical guarantees. For full implementation details, see Appendix D. The performance on a holdout set is shown in Figure 1. RECURSIVEOPTIMIZER enjoys an advantage in the poorly-conditioned regime while maintaining performance in the well-conditioned problem.

The dynamics of RECURSIVEOPTIMIZER in the poorlyconditioned problem bear some discussion. Recall that our full-matrix regret bounds actually do not appear until the sum of the gradients grows to a certain degree. It appears that this may provide a period of slow convergence during which the inner optimizer is presumably ﬁnding the optimal v, which is hard on poorly-conditioned problems. Once this is located, the algorithm makes progress very quickly. We also test RECURSIVEOPTIMIZER on benchmark deep learning models. Speciﬁcally, we test performance with the Res Net-32 (He et al., 2016) model on the CIFAR-10 image recognition dataset (Krizhevsky & Hinton, 2009) and the Transformer model (Vaswani et al., 2017; 2018) on LM1B (Chelba et al., 2013) and other textual datasets. We record train and test error, both as a function of number of iterations as well as a function of wall time. We com-

1Code available at: https://github.com/ google-research/google-research/tree/ master/recursive_optimizer

Matrix-Free Preconditioning in Online Learning

0 25000 50000 75000 100000 125000 150000 175000 200000

Recursive Adagrad

0 25000 50000 75000 100000 125000 150000 175000 200000

Recursive Adagrad

Figure 1. Test Error in Synthetic Experiments. (a): w is eigenvector with maximum eigenvalue (well-conditioned). (b): w is eigenvector with minimum eigenvalue (poorly-conditioned).

0 50000 100000 150000 200000 250000 300000 Steps

Test accuracy

Adam Adagrad Recursive Recursive+Momentum

0 200000 400000 600000 800000 1000000 Steps

Test accuracy

Adam Adagrad Recursive Recursive+Momentum

Figure 2. Deep Learning Experiments (a): Transformer test accuracy on LM1B. (b). Res Net-32 test accuracy on CIFAR-10. See Appendix

D.2 for details on the momentum heuristic.

pare performance to the commonly used Adam (Kingma & Ba, 2014) and Adagrad optimizers (see Figure 2, and Appendix D). Even though our analysis relies heavily on duality and global properties of convexity, RECURSIVEOP-

TIMIZER still performs well on these non-convex tasks: we are competitive in all benchmarks, and marginally the best in the Transformer tasks. This suggests an interesting line of future research: most popular optimizers used in deep learning operate in a proximal manner, producing each iterate as an offset from the previous iterate. This seems more appropriate to the non-convex setting as gradients provide only local information. It may therefore be valuable to develop a proximal version of RECURSIVEOPTIMIZER that performs even better in the non-convex setting.

7. Conclusion

We have presented an algorithm that successfully obtains full-matrix style regret guarantees in certain settings without sacriﬁcing runtime, space or regret in less favorable settings. The favorable settings we require for full-matrix performance are those in which the sum of the gradients exceeds the regret of some base algorithm, which should be O(

T). This suggests that any gradients with some systematic bias will satisfy our condition and exhibit full-matrix regret bounds for sufﬁciently large T. Our algorithmic design is based on techniques for unconstrained online learning, which necessitates an extra log factor in our regret over a mirror-descent algorithm tuned to the value of w . As such, it is an interesting question whether a mirror-descent style analysis can achieve similar results with oracle tuning.

Matrix-Free Preconditioning in Online Learning

Acknowledgements

We thank all the reviewers for their thoughtful comments and Ryan Sepassi for answering our questions about Tensor Flow and Tensor2Tensor.

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorﬂow: a system for large-scale machine learning. In OSDI, 2016.

Agarwal, N., Bullins, B., Chen, X., Hazan, E., Singh, K., Zhang, C., and Zhang, Y. The case for full-matrix adaptive regularization. ar Xiv preprint ar Xiv:1806.02958, 2018.

Cesa-Bianchi, N., Conconi, A., and Gentile, C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, 2004.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., and Robinson, T. One billion word benchmark for measuring progress in statistical language modeling. ar Xiv preprint ar Xiv:1312.3005, 2013.

Cutkosky, A. and Boahen, K. Online learning without prior information. In Conference on Learning Theory, pp. 643 677, 2017a.

Cutkosky, A. and Boahen, K. A. Stochastic and adversarial online learning without hyperparameters. In Advances in Neural Information Processing Systems, pp. 5059 5067, 2017b.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in Banach spaces. In COLT, 2018. URL https://arxiv.org/abs/ 1802.06293.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011.

Gonen, A. and Shalev-Shwartz, S. Faster sgd using sketched conditioning. ar Xiv preprint ar Xiv:1506.02649, 2015.

Gupta, V., Koren, T., and Singer, Y. Shampoo: Preconditioned stochastic tensor optimization. ar Xiv preprint ar Xiv:1802.09568, 2018.

Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Kempka, M., Kotłowski, W., and Warmuth, M. K. Adaptive scale-invariant online algorithms for learning linear models. ar Xiv preprint ar Xiv:1902.07528, 2019.

Kingma, D. P. and Ba, J. L. Adam: A method for stochastic optimization. In Proc. 3rd Int. Conf. Learn. Representations, 2014.

Koren, T. and Livni, R. Afﬁne-invariant online optimization and the low-rank experts problem. In Advances in Neural Information Processing Systems, pp. 4747 4755, 2017.

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Luo, H., Agarwal, A., Cesa-Bianchi, N., and Langford, J. Efﬁcient second order online learning by sketching. In Advances in Neural Information Processing Systems, pp. 902 910, 2016.

Martens, J. and Grosse, R. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408 2417, 2015.

Mc Mahan, B. and Streeter, M. No-regret algorithms for unconstrained online convex optimization. In Advances in neural information processing systems, pp. 2402 2410, 2012.

Mc Mahan, H. B. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117 3166, 2017.

Mc Mahan, H. B. and Orabona, F. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In Conference on Learning Theory, pp. 1020 1039, 2014.

Mc Mahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. ar Xiv preprint ar Xiv:1002.4908, 2010.

Orabona, F. Dimension-free exponentiated gradient. In Advances in Neural Information Processing Systems, pp. 1806 1814, 2013.

Matrix-Free Preconditioning in Online Learning

Orabona, F. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pp. 1116 1124, 2014.

Orabona, F. and P al, D. Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems, pp. 577 585, 2016.

Orabona, F. and Tommasi, T. Training deep networks without learning rates through coin betting. In Advances in Neural Information Processing Systems, pp. 2160 2170, 2017.

Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2):107 194, 2012.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017.

Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez, A. N., Gouws, S., Jones, L., Kaiser, L., Kalchbrenner, N., Parmar, N., Sepassi, R., Shazeer, N., and Uszkoreit, J. Tensor2tensor for neural machine translation. Co RR, abs/1803.07416, 2018. URL http://arxiv.org/ abs/1803.07416.

Zinkevich, M. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 928 936, 2003.