# better_fullmatrix_regret_via_parameterfree_online_learning__c0919e0b.pdf

Better Full-Matrix Regret via Parameter-Free Online Learning

Ashok Cutkosky Department of Electrical and Computer Engineering Boston University Boston, Massachusetts, USA ashok@cutkosky.com

We provide online convex optimization algorithms that guarantee improved fullmatrix regret bounds. These algorithms extend prior work in several ways. First, we seamlessly allow for the incorporation of constraints without requiring unknown oracle-tuning for any learning rate parameters. Second, we improve the regret analysis of the full-matrix Ada Grad algorithm by suggesting a better learning rate value and showing how to tune the learning rate to this value on-the-ﬂy. Third, all our bounds are obtained via a general framework for constructing regret bounds that depend on an arbitrary sequence of norms.

1 Introduction

This paper provides new algorithms for online learning, which is a popular problem formulation for modeling streaming and stochastic optimization [Zinkevich, 2003, Cesa-Bianchi and Lugosi, 2006, Shalev-Shwartz, 2011, Mc Mahan, 2014, Hazan, 2019]. Online learning is a game of T rounds between an algorithm and the environment. In each round, the algorithm ﬁrst chooses a point wt in some domain W Rd, after which the environment presents the learner with a loss function ℓt : W R. Performance is measured by the regret, which is a function of some benchmark point w: RT ( w) = PT t=1 ℓt(wt) ℓt( w).

In order to make the problem tractable, we will assume that each ℓt is convex and W Rd is a convex domain, which is often called online convex optimization. Now, if we let gt be an arbitrary subgradient of ℓt at wt, we have:

t=1 gt, wt w

Because of this fact, for the rest of this paper we consider exclusively the case of linear losses and take PT t=1 gt, wt w as the deﬁnition of RT ( w). Well-known lower bounds [Abernethy et al., 2008] tell us that even if the environment is restricted to gt 2 1 and w 2 1, no algorithm can guarantee regret better than O(

T) in all scenarios, and this bound is in fact obtained by online gradient descent [Zinkevich, 2003]. In order to go beyond this minimax result, there is a large body of work on designing adaptive algorithms [Auer et al., 2002, Hazan et al., 2008, Duchi et al., 2010, Mc Mahan and Streeter, 2010, 2012, Foster et al., 2015, Orabona, 2014, Orabona and Pál, 2016, Foster et al., 2018, Jun and Orabona, 2019, Kempka et al., 2019, van der Hoeven, 2019, Mhammedi and Koolen, 2020]. Many of these prior algorithms obtain bounds like:

RT ( w) w 2

t=1 gt 2 2 (1)

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

where 2 is the standard 2-norm. This type of bound is appealing: in the worst-case we never do worse than the minimax optimal rate, but in many cases we can do much better. For example, if w 2 is small (intuitively, the benchmark point is simple ), or if the gt 2 values are small (intuitively, the losses are simple ), then we obtain low regret. The challenge in obtaining these kinds of bounds lies in the fact that the values that appear in the regret guarantee are unknown to the algorithm and so intuitively the algorithm must somehow learn about them on-the-ﬂy.

For a more nuanced form of adaptive bound, one can look to the full-matrix bounds. A prototypical such bound takes the form:

t=1 gt, w 2

where here r is the rank of the subspace spanned by the gt. Such a bound may be desirable because it in some sense ignores irrelevant directions in the gt by projecting them all along w. Such bounds are available in the prior work of Mhammedi and Koolen [2020], Kotłowski [2019], Cutkosky and Orabona [2018] for the unconstrained setting when W = Rd. For the constrained setting, to the best of our knowledge such a bound is only available in Koren and Livni [2017] subject to tuning a learning rate optimally using prior knowledge of PT t=1 gt, w 2, which is unlikely to be available. In this paper we provide the ﬁrst algorithm the achieve (2) for general convex domains W rather than entire vector spaces.

Next, we provide a reﬁned analysis of the regret of the full-matrix Ada Grad algorithm [Duchi et al., 2010]. Prior analysis of full-matrix Ada Grad yields a regret bound that is never better than (1)1. Nevertheless, full-matrix Ada Grad is empirically successful, suggesting that something is missing from the analysis. We posit that the missing ingredient is a suboptimal tuning of the learning rate, and show that with oracle tuning (which is unavailable apriori) one can obtain the regret bound:

t=1 gtg t w

We provide an interpretation of this bound suggesting that it allows for small regret when PT t=1 gtg t is approximately low-rank, and develop an algorithm that obtains this bound with requiring any tuning. Intriguingly, the three regret bounds (1), (2), and (3) are all incomparable - there are sequences of gt such that any one of them might be signiﬁcantly better than the others.

In order to obtain these results, we develop a general algorithm that takes a sequence of increasing norms 0, . . . , T 1 and obtains regret

t=1 gt 2 t 1,

Here t, is the dual norm to t. The norms t may be generated on-the-ﬂy (e.g. t can depend on gt). Further, our algorithm can incorporate arbitrary convex domains W. Prior adaptive algorithms have typically required speciﬁc forms of W, such as being an entire vector space or having bounded diameter, and have often focused on a single norm.

This paper is organized as follows: in Section 2, we lay out our setting and introduce some background from the literature. In Section 3, we describe our our intermediate result achieving the bound (4). Then, in Sections 4 and 5, we show how to use our approach to achieve bounds (2) and (3).

2 Preliminaries

2.1 Notation and Setup

We use 0, . . . , T 1 to indicate an arbitrary sequence of T potentially different norms. To avoid confusion between the Lp norm and the pth element of a sequence of norms, we denote the Lp

1Note that the prior bound for the diagonal Ada Grad algorithm is different and can improve over (1).

norm using a bold font: p. For a symmetric positive-deﬁnite matrix M, the norm deﬁned by M is denoted by x M =

x Mx. We will use the notation Gt = Pt i=1 gig i as a shorthand for the sum of the outer product of the loss vectors gt. Finally, the dual of a norm is x = sup y 1 y, x .

We restrict our attention to those norms such that the function 1

2 2 is σ-strongly-convex with respect to the same norm for some σ. A function f : W R is σ-strongly-convex if for all x and y and g f we have f(y) f(x) + g, y x + σ

We will assume W is a convex set for which it is possible to compute the projection operation Π(x) = argminw W w x for any norm we are interested in. We will also usually require gt 1 for the norms we consider. We recall for convenience here that g 2 M, = g, M 1g

Finally, in order to ease exposition we have suppressed many constants and occasionally a logarithmic factor in our main presentation. We provide full characterizations of all our results including constant factors in the Appendix along with any proofs not in the main text.

In the next subsections, we describe some material from prior literature we will use to construct our algorithms.

2.2 Follow-the-Regularized-Leader

Follow-the-Regularized-Leader (FTRL) [Shalev-Shwartz, 2007] is one of the most successful abstractions for designing online convex optimization algorithms (see Mc Mahan [2014] for a detailed survey). FTRL algorithms produces w1, . . . , w T through the use of regularizer functions ψ0, . . . , ψT 1. Speciﬁcally, wt+1 is given by:

wt+1 = argmin w W ψt(w) +

The following result characterizes the regret of FTRL:

Theorem 1 (Adapted from Mc Mahan [2014] Theorem 1). Suppose each ψt is σt-strongly-convex with respect to a norm t for some σt, and ψt+1(w) ψt(w) for all t and all w W. Further suppose infw W ψ0(w) = 0. Then the regret of FTRL is bounded by:

RT ( w) ψT 1( w) + 1

gt 2 t 1, σt 1

where recall we deﬁne g = sup x 1 g, x for any seminorm .

The FTRL algorithm template has been used to great effect through clever choices of regularizer functions ψt. However, most prior adaptive algorithms based on FTRL (e.g. [Duchi et al., 2010, Mc Mahan and Streeter, 2010]) require tuning some learning rate parameter η to the value of w . Unfortunately, the optimal value of η is unknown a priori (and maybe even a posteriori) because we do not know what w 2 is.

2.3 Parameter-Free Algorithms

In an effort to ﬁx the need to tune learning rates, much work has gone into designing parameter-free algorithms that can adapt to unknown values of w [Mc Mahan and Streeter, 2012, Orabona, 2013, Orabona and Pál, 2016, Foster et al., 2017a, Cutkosky and Boahen, 2017, Foster et al., 2018, Cutkosky and Orabona, 2018, Kempka et al., 2019]. These algorithms make use of a known bound on the norm of gt in order to achieve adaptivity to w . We will make use of the following recent bound (which is optimal up to constants and quantities inside logarithms):

Theorem 2 (Adapted from Cutkosky and Sarlos [2019] Theorem 2). For any user-speciﬁed values ϵ > 0 and 0 Z 1, there exists an online convex optimization algorithm with domain W = R

that runs in time O(1) per update such that if |gt| 1 for all t, the regret is bounded by:

t=1 gt(wt w) O

ϵ + | w| max

v u u u u t

1 + PT t=1 g2 t Z log

1+PT t=1 g2 t Z 1

1+PT t=1 g2 t Z 1

In order to ease notation in our results, we will just set Z = 1 and drop the Z dependency in Theorem 2 from all bounds in the paper. For completeness, we provide a proof of this result in Appendix F.

3 Adapting to Varying Norms

In this Section, we show our how to achieve the regret bound (4) in arbitrary convex domains W. We decompose the problem into three stages: ﬁrst, we use FTRL to obtain an regret bound with a very poor dependence on w T 1. Then, we will show how to combine this with a one-dimensional parameter-free algorithm to obtain the desired bound in the case that W is an entire vector space. Finally, we will show how to constrain our algorithm to arbitrary convex W. Although each of our individual steps is pleasingly straightforward, the ﬁnal result is surprisingly powerful, as it will allow us to easily obtain our new full-matrix bounds.

Our FTRL algorithm is reminiscent of prior adaptive methods, but we enforce a special time varying constraint. This will make the algorithm much worse on its own, but allow for an overall improvement later. Speciﬁcally, suppose we have a sequence of norms 0, . . . T 1 such that x t x t 1, and 1

2 2 t is σ-strongly-convex with respect to t for all t and x. Consider FTRL with regularizers:

( 1 σ w 2 t q

1 + Pt i=1 gi 2 i 1, if w t 1 if w t > 1 (6)

Then we have the following corollary of Theorem 1:

Lemma 3. Let W be a real vector space and 1, . . . , T are an increasing sequence of norms on W such that 1

2 t is σ-strongly-convex with respect to t. Suppose we run FTRL with regularizers given by (6), and with gt satisfying gt t 1, 1 for all t. Then wt t 1 1 for all t, and for all w with w T 1 1, the regret of FTRL is bounded by

RT ( w) 1 σ

t=1 gt 2 t 1, +

t=1 gt 2 t 1,

Note that this bound is actually much worse than one might typically expect, as the outputs are constrained to smaller and smaller balls over the course of the algorithm. Counterintuitively, this property is crucial for improved results in the next section.

3.1 Unconstrained Domains

Now, with Lemma 3 in hand, we will proceed to build an algorithm that achieves the bound (4) in the unconstrained setting. Our technique is based on the dimension-free to one-dimensional optimization reduction proposed by Cutkosky and Orabona [2018], taking into account the particular dynamics of our FTRL algorithm. The pseudocode for this technique is presented in Algorithm 1 below.

Algorithm 1 Unconstrained Varying Norms Adaptivity

Input: sequence of norms 0, . . . , T 1, real vector space W, strong-convexity parameter σ. Instantiate one-dimensional parameter-free online learning algorithm A from Theorem 2. Set ψ0(x) = 1

2σ x 2 0. Set x1 = argminw W ψ0(w). for t = 1 . . . T do Get yt R from A. Output wt = ytxt and get gt.

Set ψt(x) =

1 + Pt i=1 gi 2 i 1, if x t 1 if x t > 1 Set xt+1 = argminw W ψt(w) + Pt i=1 gi, w . Send st = gt, xt to A as the tth loss. end for

Lemma 4. Under the assumptions of Lemma 3, for any w W (recall we assume W is an entire vector space in Lemma 3), the regret of Algorithm 1 is bounded by:

ϵ + 2 w T 1 min(1, σ) max

t=1 gt 2 t 1, log

PT t=1 gt 2 t 1, w T 1 ϵ

PT t=1 gt 2 t 1, w T 1 ϵ

3.2 Adding Constraints

Algorithm 1 provides a method for obtaining the bound (4) when W is an entire vector space, so in this section we show how to ﬁx the algorithm so that W may be an arbitrary convex domain. We do this by again appealing to a technique from [Cutkosky and Orabona, 2018]. This time, we use their Theorem 3, which provides a way to produce constrained algorithms from unconstrained algorithms. The original result considers only the case of a ﬁxed norm and is applied to achieve bounds like (1). We modify the technique to consider varying norms as well. The algorithm is presented in Algorithm 2 below, and the analysis achieving (4) is in Theorem 5.

Algorithm 2 Varying Norms Adaptivity

Input: Convex domain W in a real vector space V . Deﬁne Πt(v) = argminw W v w t 1. Deﬁne St(v) = v Πt(v) t 1. Initialize Algorithm 1 with domain V using the algorithm of Theorem 2 as the base learner. for t = 1 . . . T do Get tth output vt V from Algorithm 1. Output wt = Πt(vt), and get loss gt. Deﬁne ℓt(v) = 1

2 ( gt, v + gt t 1, St(v)). Let ˆgt ℓt(vt), and send ˆgt to Algorithm 1 as the tth loss. end for

Theorem 5. Each output wt of Algorithm 2 lies in W, and the regret for any w W is at most:

ϵ + w T 1 min(1, σ) max

t=1 gt 2 t 1, log

PT t=1 gt 2 t 1, w T 1 ϵ

PT t=1 gt 2 t 1, w T 1 ϵ

4 Full-Matrix Bounds

The results of the previous section operate with arbitrary norms and in potentially inﬁnite dimensional spaces. In this section and the next, we will specialize to the case W Rd, and show how to obtain so-called full-matrix or preconditioned regret bounds. In this section, we will consider the full-matrix regret bound given by (2).

Up to a factor of p

log(T), this bound is achieved in the case where W is an entire vector space by Mhammedi and Koolen [2020], and similar bounds utilizing various extra assumptions or worse log factors are obtained by Cutkosky and Orabona [2018], Kotłowski [2019], Cesa-Bianchi et al. [2005]. When W is not an entire vector space, it seems harder to achieve this bound. However, some progress has been made in certain settings. For example, when W is the probability simplex, Foster et al. [2017b] achieves a bound

r T, which adapts automatically to r. For more general W, Koren and Livni [2017] achieves the desired result if their algorithm is tuned with oracle knowledge of PT t=1 gt, w 2.

To appreciate some of the subtleties of incorporating arbitrary constraints W, let us consider one straw-man solution. We might be tempted to start with the algorithm of Mhammedi and Koolen [2020] that achieves (2) in the unconstrained setting, and then try to add constraints through direct application of the unconstrained-to-constrained reduction proposed in Cutkosky and Orabona [2018], which is also the key component of our Algorithm 2. Unfortunately, this might fail because the alterations to the gradients necessary to incorporate the constraints may destroy the original regret bound. If the original losses are gt, and the losses supplied to the unconstrained algorithm after performing the

unconstrained-to-constrained transformation are gt, the ﬁnal regret will be O q

r PT t=1 gt, w 2 ,

where r is the rank of the gt. It is not clear what the relationship is between this quantity and the

desired bound O q

r PT t=1 gt, w 2 . The reduction allows us to guarantee gt 2 gt for

any given norm , but this does not sufﬁciently elucidate the problem. For example, even if we knew the norm GT in advance and ensured gt G 1 T 2 gt G 1 T , applying Cauchy-Schwarz would

yield a regret of O w GT q

r PT t=1 gt 2 G 1 T

= O(r w GT ), which has the wrong dependence on the rank r.

Perhaps surprisingly given this difﬁculty, a straightforward application of Theorem 5 allows us to obtain (2), up to a factor of log(T). Note that this is p

log(T) worse than Cutkosky and Orabona [2018], but we are able to handle any convex domain. Intuitively, one can view our Algorithm 1 as providing a specially-treated unconstrained regret bound that is designed speciﬁcally to work around the difﬁculties in naively applying the unconstrained-to-constrained reduction to an arbitrary unconstrained algorithm.

The key idea in our approach is that the norms t used by Algorithm 2 need not be speciﬁed ahead of time: so long as t depends only on g1, . . . , gt, it is still possible to run the algorithm. Next, observe that PT t=1 gt, w 2 can be viewed as w 2 GT , where we recall that GT is the norm induced by GT : x 2 GT = x GT x. Inspired by these observations, our approach is to run Algorithm 2 using norms t = Gt. The algorithm is analyzed in Theorem 6 below.

Theorem 6. Suppose gt satisﬁes gt 1 for all t where is any norm such that 1

2 2 is σ-strongly convex with respect to . Let Gt = Pt i=1 gig i and let r be the rank of GT . Suppose we run Algorithm 2 with x 2 t = x 2 + x (I + Gt)x, where I is the identity matrix. Then we obtain regret RT ( w) bounded by:

w 2 + w 2 2 + PT t=1 gt, w 2

min(σ, 1) max

1 + r log(T) q

w 2 + w 2 2 + PT t=1 gt, w 2

v u u u tr log(T) log

1 + r log(T) q

w 2 + w 2 2 + PT t=1 gt, w 2

Proof. We have x 2 t = x 2 + x (I + Gt)x = x 2 + x 2 2 + Pt i=1 gt, x 2, so that t is increasing in t. Further, since x t 1 x , we must have gt t 1, 1 for all t. Next, observe that since gt 1, we have

x 2 + x 2 2 +

i=1 gi, x 2 x 2 2 +

i=1 gi, x 2 = x (I + Gt)x

Therefore, we have gt t 1, g t (I + Gt) 1gt. Now recall that for any PSD matrix M, 1

2x Mx is 1-strongly convex with respect to the norm

x Mx. Therefore, by Lemma 8 (provided in supplement), we have that 1

2 x 2 t is min(σ, 1)-strongly convex with respect to t so that we have satisﬁed all the hypotheses of Theorem 5. Finally, before we apply Theorem 5, we need to analyze

t=1 gt 2 t 1,

t=1 g t (I + Gt) 1gt log

det(I + Pt i=1 gtg t ) det(I)

rank(GT ) log(T + 1)

where we have applied Lemma 11 of Hazan et al. [2007]. The result now follows from Theorem 5.

Note that for concreteness, if we set = 2 in the above bound, then the norms t become the familiar matrix-based norm x t = p

x (2I + Gt)x. We have opted to leave the more general formulation in place to allow for gt that are not bounded in the L2 norm.

5 Full-Matrix Adagrad with Oracle Tuning

In this section we consider a different kind of full-matrix bound inspired by the full-matrix Ada Grad algorithm [Duchi et al., 2010]. Full-matrix Ada Grad can be described as FTRL using regularizers:2

η x, (I + Gt)1/2, x

where η is a scalar learning rate parameter that must be set by the user. (I + Gt)1/2 indicates the symmetric positive-deﬁnite matrix square-root of I + Gt, which exists since I + Gt is a symmetric positive-deﬁnite matrix. This algorithm is empirically very successful, in spite of the computational overhead coming from manipulating the d d matrix Gt. Indeed, much work has gone into providing approximate versions of this algorithm that reduce the computation load while still retaining some performance beneﬁts [Gupta et al., 2018, Agarwal et al., 2019, Chen et al., 2019]. Prior analyses of full-matrix Ada Grad considers domains W with ﬁnite diameter D = supx,y W x y 2, and suggests setting η = O(D) to obtain a regret bound of:

RT ( w) O(Dtr(G1/2 T ))

However, by linearity of trace and concavity of square root, we have:

Dtr(G1/2 T ) D p

tr(GT ) = D

The bound RT ( w) D q PT t=1 gt 2 2 can be achieved by simple (and fast) online gradient descent

with a scalar learning rate, wt+1 = wt Dgt Pt i=1 gt 2 2 , so the prior regret bound of full-matrix

Ada Grad does not appear to show any beneﬁt gained by the extra matrix computations. This poses a mystery: since the actual algorithm is so effective, it seems we are missing something in the analysis. We propose a possible explanation for this quandary. The main idea is that, in practice, the theoretical guidance to set η = O(D) is rarely used. Instead, η is tuned via manually checking different values to ﬁnd which is empirically best. Thus, if we could show that full-matrix Ada Grad achieves gains with an oracle-tuning for η, this might explain the improved performance in practice.

2In Duchi et al. [2010], this version of Ada Grad is called the Primal-Dual update version.

To this end, recall that from Theorem 1 we can write the regret of full-matrix Ada Grad as:

w (I + GT )1/2 w η + η

t=1 g t (I + Gt 1) 1/2gt

w G1/2 T w η + ηtr(G1/2 T )

where the second inequality is due to Lemma 10 of Duchi et al. [2010], and we have ignored the dependence on I for simpler exposition. Then it is clear that with the optimal tuning of η =

tr(G1/2 T )

, we obtain regret bound of (3). In order to appreciate the potential of this bound,

let us construct a particular sequence of gts and evaluate the bound. We will compare the bound (3) to both (2) and (1). Our example will illustrate that (3) can in some sense adapt to the case that GT is full-rank but approximately low rank , while the analysis of the full-matrix algorithm in Section 4 does not obviously allow for such behavior.

Let v1, . . . , vd be an orthonormal basis for the d-dimensional vector space containing W. Assume d is a perfect square and T = 2d + 2k

d for some integer k. For the ﬁrst d rounds, gt = vt and for the second d rounds gd+t = vt. For the remaining rounds, we write t = i + j

d + 2d for j Z

d, and set gt = 1

dvd + ( 1)jq

d vi. Intuitively, the losses are cycling with

alternating signs through the ﬁrst

d basis vectors, but always maintain a small positive component in the direction of vd. Notice that since T 2d is a multiple of 2

d, the alternating signs imply that PT t=1 gt is a positive scalar multiple of vd. Consider w = vd. Then, we have:

t=1 gt 2 2 = O(

v u u trank(GT )

t=1 gt, wt 2 = O(

w, G1/2 T w tr(G1/2 T ) = O q

In this case, the trace of q PT t=1 gtg t captures the fact that even though the gt span d dimensions,

they are approximately contained in

d dimensions. This allows bound (3) to perform much better than either of the other bounds. In contrast, if the example is modiﬁed so that the ﬁrst 2d rounds only cycle between the ﬁrst

d basis vectors, we would have rank(GT ) =

d and so the full-matrix bound (2) is the best. Finally, if we increase the component on vd in each round to, for example, 1

2, then the bound (1) is the smallest. Therefore none of the bounds uniformly dominates the others.

To gain a little more intuition for what the bound (3) means, let us investigate the worst-case performance of the bounds (1), (2) and (3) over all w with w 2 1. To this end, write Teff = PT t=1 gt 2 2 and let λmax = sup w 1 PT t=1 gt, w 2. Then we clearly have (1) is O( Teff) while the bound (2) is at most O( rλmax). On the other hand, by Cauchy-Schwarz inequality we have tr(G1/2 T ) reff Teff where reff r is some effective rank that might be much lower than the true rank r. With this notation, we have that the bound (3) is at most (λmaxreff Teff)1/4. Thus, we see that the new bound is at most the geometric mean of the bounds (1) and (2), but could potentially be much lower if the effective rank reff is smaller than r.

5.1 Achieving the Optimal Full-Matrix Ada Grad Bound

Now that we see there is some potential advantage to a bound like (3), we will show how to obtain the bound without manually tuning η using our framework. The approach is very similar to how we obtained the bound (2): we run Algorithm 2 and in round t we set t = G1/2 t . With this setting, the desired bound is an almost immediate consequence of Theorem 5:

Theorem 7. Suppose W Rd and gt satisﬁes gt 2 1 for all t. Let Gt = Pt i=1 gig i . Deﬁne t be x 2 t = x (I + Gt)1/2x. Then the regret of Algorithm 2 using these norms is bounded by:

( w 2 2 + w G1/2 T w)tr G1/2 T !

where the O notation hides a logarithmic dependency on tr G1/2 T q

w 2 2 + w G1/2 T w.

This Theorem recovers the desired bound (3) up to log factors. Moreover, it is possible to interpret the operation of the algorithm as in some rough sense learning the optimal learning rate required for the original Ada Grad algorithm to achieve this bound.

Proof. Observe that since gt 2 1, we have gt t 1, = gt (I+Gt 1) 1/2 gt 2 1 so that the hypotheses of Theorem 5 are satisﬁed. In order to complete the analysis we need only calculate:

t=1 gt 2 t 1, =

t=1 g t (I + Gt 1) 1/2gt

t=1 g t G 1/2 t gt 2tr(G1/2 T )

Here, in the ﬁrst inequality, we mildly abuse notation to indicate the pseudo-inverse of G1/2 t as G 1/2 t 1 . The inequalities then follow from Duchi et al. [2010] Lemmas 9 and 10.

Finally, observe that (I + GT )1/2 I + G1/2 T , and apply Theorem 5 to obtain the result.

6 Conclusion

We have introduced online linear optimization algorithms that achieve new full-matrix bounds. We generalizing prior work to arbitary domains, and we provide an improvement on the full-matrix Ada Grad bound. Our results are consequences of a general method for constructing regret bounds in terms of arbitrary sequences of norms.

Our results raise several interesting open questions. Firstly, our full-matrix regret bound seems to be a factor of p

log(T) worse than the best rate in the unconstrained case, suggesting that there is some room to improve our algorithm or analysis. Second, our present results seem restricted to using FTRL algorithms with regularizers centered at the origin. If instead it was possible for the center to vary of the course of the algorithm, one might hope achieve bounds similar to those obtained by [van Erven and Koolen, 2016] in a more general or more efﬁcient manner. Finally, note that in the unconstrained setting, Cutkosky and Sarlos [2019] achieves a bound that appears to dispense with the r factor in (2). Their bound holds only if the gradients gt satisfy a certain non-randomness condition. Although our bounds hold unconditionally, this raises the interesting question of whether their conditional bound can be achieved in the constrained setting.

Broader Impact

Our work introduces new algorithms and analysis for generic optimization problems. Our contribution is entirely mathematical, and we do not anticipate it engendering negative ethical concerns.

Acknowledgments and Disclosure of Funding

Much of this work was performed while employed at Google Research.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928 936, 2003.

Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011.

H. Brendan Mc Mahan. A survey of algorithms and analysis for adaptive online learning. ar Xiv preprint ar Xiv:1403.3465, 2014.

Elad Hazan. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, 2019.

Jacob Abernethy, Peter L Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the nineteenth annual conference on computational learning theory, 2008.

Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-conﬁdent on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48 75, 2002.

Elad Hazan, Alexander Rakhlin, and Peter L Bartlett. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, pages 65 72, 2008.

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory (COLT), 2010.

H. Brendan Mc Mahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.

Brendan Mc Mahan and Matthew Streeter. No-regret algorithms for unconstrained online convex optimization. In Advances in neural information processing systems, pages 2402 2410, 2012.

Dylan J Foster, Alexander Rakhlin, and Karthik Sridharan. Adaptive online learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3375 3383. Curran Associates, Inc., 2015.

Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pages 1116 1124, 2014.

Francesco Orabona and Dávid Pál. Coin betting and parameter-free online learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 577 585. Curran Associates, Inc., 2016.

Dylan J. Foster, Alexander Rakhlin, and Karthik Sridharan. Online learning: Sufﬁcient statistics and the burkholder method. In Conference on Learning Theory (COLT), 2018.

Kwang-Sung Jun and Francesco Orabona. Parameter-free online convex optimization with sub-exponential noise. In Conference on Learning Theory, pages 1802 1823, 2019.

Michal Kempka, Wojciech Kotlowski, and Manfred K Warmuth. Adaptive scale-invariant online algorithms for learning linear models. In International Conference on Machine Learning, pages 3321 3330, 2019.

Dirk van der Hoeven. User-speciﬁed local differential privacy in unconstrained adaptive online learning. In Advances in Neural Information Processing Systems, pages 14080 14089, 2019.

Zakaria Mhammedi and Wouter M Koolen. Lipschitz and comparator-norm adaptivity in online learning. In Conference on Learning Theory, 2020.

Wojciech Kotłowski. Scale-invariant unconstrained online learning. Theoretical Computer Science, 2019.

Ashok Cutkosky and Francesco Orabona. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pages 1493 1529, 2018.

Tomer Koren and Roi Livni. Afﬁne-invariant online optimization and the low-rank experts problem. In Advances in Neural Information Processing Systems, pages 4747 4755, 2017.

S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. Ph D thesis, The Hebrew University of Jerusalem, 2007.

Francesco Orabona. Dimension-free exponentiated gradient. In Advances in Neural Information Processing Systems, pages 1806 1814, 2013.

Dylan J Foster, Satyen Kale, Mehryar Mohri, and Karthik Sridharan. Parameter-free online learning via model selection. In Advances in Neural Information Processing Systems, pages 6020 6030, 2017a.

Ashok Cutkosky and Kwabena Boahen. Online learning without prior information. In Conference on Learning Theory, pages 643 677, 2017.

Ashok Cutkosky and Tamas Sarlos. Matrix-free preconditioning in online learning. In International Conference on Machine Learning, pages 1455 1464, 2019.

Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron algorithm. SIAM Journal on Computing, 34(3):640 668, 2005.

Dylan J Foster, Alexander Rakhlin, and Karthik Sridharan. Zigzag: A new approach to adaptive online learning. In Conference on Learning Theory, pages 876 924, 2017b.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, pages 1837 1845, 2018.

Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, and Yi Zhang. Efﬁcient full-matrix adaptive regularization. In International Conference on Machine Learning, pages 102 110, 2019.

Xinyi Chen, Naman Agarwal, Elad Hazan, Cyril Zhang, and Yi Zhang. Extreme tensoring for low-memory preconditioning. ar Xiv preprint ar Xiv:1902.04620, 2019.

Tim van Erven and Wouter M Koolen. Metagrad: Multiple learning rates in online learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3666 3674. Curran Associates, Inc., 2016.