# parameterfree_dynamic_and_stronglyadaptive_online_learning__d57ff4de.pdf

Parameter-free, Dynamic, and Strongly-Adaptive Online Learning

Ashok Cutkosky 1 2

We provide a new online learning algorithm that for the ﬁrst time combines several disparate notions of adaptivity. First, our algorithm obtains a parameter-free regret bound that adapts to the norm of the comparator and the squared norm of the size of the gradients it observes. Second, it obtains a strongly-adaptive regret bound, so that for any given interval of length N, the regret over the interval is O(

N). Finally, our algorithm obtains an optimal dynamic regret bound: for any sequence of comparators with path-length P, our algorithm obtains regret O(

PN) over intervals of length N. Our primary technique for achieving these goals is a new method of combining constrained online learning regret bounds that does not rely on an expert meta-algorithm to aggregate learners.

1. Online Learning and Strong-Adaptivity

Online learning is a popular way to analyze iterative optimization algorithms (Shalev-Shwartz, 2011; Zinkevich, 2003). Brieﬂy, online learning is a game played between the learning algorithm and an environment over T rounds. In each round, the learning algorithm outputs a vector wt in some convex space W and then the environment reveals a loss function ℓt and the learner suffers loss ℓt(wt). The goal is typically to obtain small regret:

t=1 ℓt(wt) ℓt( w)

We can use this game to model the progress of an iterative machine learning training procedure: let wt be the tth set of model parameters output by the training algorithm and ℓt be the loss on the tth minibatch. Then the regret simply

1Google Research 2Boston University, Boston Massachusetts, USA. Correspondence to: Ashok Cutkosky <ashok@cutkosky.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

measures how fast the training algorithm is able to converge to the optimal parameters w. More generally, online learning can be used to model many sequential decision making processes in which the characteristics of the ℓt may change from round to round. For example, ℓt may represent outcomes varying from the effects of a medical treatment to the score obtained by a game-playing robot.

In order to achieve tractable theoretical guarantees, in this paper we will make the common assumption that each ℓt is a convex function. With this assumption, we can set gt to be a subgradient of ℓt at wt and then obtain

t=1 gt, wt w (1)

All our results will come from upper bounding this linearized regret, so for simplicity in the rest of this paper we assume that each loss ℓt is linear and use the right hand side of (1) as the deﬁnition of the regret. This setting is often called online linear optimization (OLO), and many popular and successful algorithms have been designed an analyzed for this setting (Duchi et al., 2010; Mc Mahan & Streeter, 2010; Zinkevich, 2003)

It is well-known that in online linear optimization, one cannot guarantee worst-case regret better than O(DG

T), where G = max gt and D = supx,y W x y is the diameter of the domain W, and this bound is achieved by online gradient descent (Abernethy et al., 2008). As a result, much of the work in online linear optimization is in designing adaptive algorithms (Hazan et al., 2008; Duchi et al., 2010; Mc Mahan & Streeter, 2010). These algorithms should perform better in non-worst-case settings without sacriﬁcing worst-case guarantees, and typically obtain bounds of the form:

The above bound achieves adaptivity to non-worst-case sequences of losses gt, but does not adapt to a non-worstcase comparison point w. In order to ﬁx this, so-called parameter-free algorithms (Cutkosky & Orabona, 2018;

Kempka et al., 2019; Cutkosky & Sarlos, 2019; van der Hoeven, 2019; Mhammedi & Koolen, 2020) obtain smaller

Parameter-free, Dynamic, and Strongly-Adaptive Online Learning

regret when w is small:

In a parallel direction, other works go beyond the minimax optimality of online gradient descent by considering harder notions of regret. In particular, one attractive goal is to obtain a so-called strongly-adaptive guarantee. Stronglyadaptive algorithms obtain:

t=a gt, wt w = O DG

for all intervals [a, b] [1, T] simultaneously. This guarantee was ﬁrst obtained by (Daniely et al., 2015), and later (Jun et al., 2017) improved the logarithmic dependencies to a regret bound of O DG p

T log(T) . Intuitively, these algorithms provide robustness to environments that display some kind of shifting behavior, with losses on different intervals having very different statistics.

Finally, the notion of dynamic regret allows the comparison point w to change:

RT ( w1, . . . , w T ) =

t=1 gt, wt wt

This setting has been studied in several previous works (often under the names shifting or tracking regret) (Herbster & Warmuth, 1998; Gyorgy et al., 2012; Gyorgy & Szepesv ari, 2016; Zhang et al., 2018a), and there are several notions of performance in this setting (see Zhang et al. (2018b) for a description of many of them), but in this paper we focus on obtaining a regret bound in terms of the pathlength P = PT 1 t=1 wt+1 wt . Online gradient descent already obtains regret D

T, but this can be improved to

DPT (and no further) without any knowledge of P, as shown by Zhang et al. (2018a).

Given these disparate threads of research, it is natural to look for an approach that can obtain all three goals in a single algorithm. This is the main result of our paper: given any norm , we provide an algorithm such that for any interval [a, b] the dynamic regret over that interval is:

t=a gt, wt wt

t=a wt+1 wt

t=1 gt 2 log(T)

Moreover, for the interval [1, T] our algorithm maintains a

non-dynamic parameter-free regret bound:

t=1 gt 2 log( w

Note that in all cases, our bounds achieve adaptivity to the value of gt 2, which we call a second-order bound, and so automatically perform better when the losses are less adversarial.

Previous algorithms obtain parts of this goal. For example, the algorithm of (Zhang et al., 2018a) obtains optimal dynamic regret with a second-order bound, but only for the interval [a, b] = [1, T] (it is not strongly-adaptive). On the other hand, (Zhang et al., 2019a) does not consider the dynamic-regret setting, but obtains a strongly-adaptive

algorithm with regret D q

max gt Pb t=a gt for any interval [a, b]1, which does not quite achieve a second-order bound. To the best of our knowledge, the only other algorithm to achieve strongly adaptive regret that also obtains a second-order bound is (Zhang et al., 2019b). However, their algorithm runs in O(log2(T)) time per update, while ours runs in O(log(T)) time, which matches the best-known rate for any strongly-adaptive algorithm.

We are not aware of any algorithms that achieve the parameter-free regret bound while also achieving either strongly-adaptive or dynamic regret bounds. Further, we believe our result is the ﬁrst to combine strongly-adaptive regret with the optimal dependence on path-length for dynamic regret. Interestingly, it turns out that this latter result is due almost entirely to an improvement in analysis rather than a novel property of our algorithm. We suspect that our same proof shows that essentially all known stronglyadaptive algorithms to-date also achieve the optimal dependence on the path-length in dynamic regret.

In order to obtain these improved results, our techniques are qualitatively somewhat different from previous approaches. The standard paradigm for designing strongly-adaptive methods, as outlined initially by (Daniely et al., 2015), is to instantiate O(log(T)) different online learning algorithms, each dedicated to a different interval in [1, T]. The intervals are chosen cleverly so that any [a, b] can be decomposed into a disjoint union of log(b a) intervals. Then, a metaalgorithm is used to aggregate the O(log(T)) outputs of the base algorithms in order to create a single method that never performs much worse than any of the base algorithms.

Our approach eschews the use of a meta-algorithm. Instead, we provide a generic technique that takes an online learner and any interval [a, b] and provides a new learner

1Their algorithm technically considers the case of non-negative smooth losses, but with a little effort one can see they achieve this stated bound in our setting.

Parameter-free, Dynamic, and Strongly-Adaptive Online Learning

whose regret on any interval I [a, b] is at most an additive constant worse than the regret of the original learner on that interval, and whose regret on the interval [a, b] is

at most O q Pb t=a gt 2 . Intuitively, on the interval

[a, b], we run a second online learner whose predictions are added to those of the ﬁrst learner on that interval. The second learner s objective is in some sense to minimize the residual in the loss obtained by the ﬁrst learner. This new approach is what allows our algorithm to maintain the parameter-free regret bound.

2. Setting, Notation, and General Strategy

We consider the online linear optimization game in which the learner outputs wt W for some convex domain W and then suffers loss gt, wt for some loss vector gt. We deﬁne the dynamic regret over an interval I = [a, b] as

t I gt, wt wt

where w denotes the vector of comparison points w1, . . . , w T . We will occassionally refer to RI( w) without a bold-font w. This indicates a non-dynamic regret in which w = ( w, . . . , w). Given an interval I = [a, b], we denote the comparator path-length over the interval as PI = Pb 1 t=a wt+1 wt . We also use a compressedsum notation to reduce clutter in formulas: g 2 a:b indicates Pb t=a gt 2.

We assume that the domain W is a convex subset of Rd, and let indicate the standard Euclidean norm. Note that all of our presentation is essentially unchanged if W is a subset of a Hilbert space and is the Hilbert space norm, but we restrict to Rd for ease of exposition. We will constrain gt to satisfy gt 1 for all t. We assume that our domain W is bounded, and use D to indicate the diameter of W with respect to this norm: D = supx,y W x y . Finally, we assume that the origin is in W. This assumption is without loss of generality: if the origin is not in W, we can simply choose some arbitrary point w W and translate our coordinates to make w the origin.

A key component in our analysis is the existance so-called parameter-free algorithms (Mcmahan & Streeter, 2012;

Orabona, 2014; Orabona & P al, 2016; Foster et al., 2018; Cutkosky & Orabona, 2018). These are algorithms that obtain non-dynamic regret

R[1,T ]( w) O ϵ + w p

T log(T w /ϵ)

where ϵ > 0 is an arbitrary user-speciﬁed constant. The key advantage of these algorithms is that their regret automatically adjusts to the value of w . In particular, if w = 0, the regret is bounded by the constant ϵ. Put another way, the

total loss of the algorithm, PT t=1 gt, wt cannot be larger than ϵ. In this paper, we will use algorithms that not only are parameter-free, but also achieve a second-order regret bound. Speciﬁcally, we have the following consequence of results in (Cutkosky & Sarlos, 2019; Cutkosky & Orabona, 2018):

Theorem 1. For any convex set W, and any ψ > 0, there exists an algorithm that obtains (non-dynamic) regret

R[1,T ]( w) =

t=1 gt, wt w

(1 + gt 2 1:T ) log w (1 + gt 2 1:T ) ψ

+ C w log w (1 + gt 2 1:T ) ψ

for some absolute constant C, where gt 2 1:T is notation for PT t=1 gt 2, so long as gt 1 for all T

2.1. Overview of Approach

Our general approach bears many similarities to the pioneering work of (Daniely et al., 2015). We will construct a set of representative intervals S such that any interval [a, b] can be written as a disjoint union of at most log2(b a) intervals in S. Then we will construct an algorithm such that, for each interval I S, the regret over the interval is O( p

|I|), where |I| is the length of the interval. Since each general interval [a, b] can be constructed using a small number of intervals in S, this will be sufﬁcient to show strongly-adaptive regret. The run-time of our algorithm will be proportional to the number of intervals in S, and so it is important that S be reasonably small. As outlined in Section 5, we use the geometric covering intervals suggested by (Daniely et al., 2015), which leads to |S| = O(log(T)).

In order to build an algorithm that has low regret for each interval I S, we provide a generic procedure that takes an arbitrary algorithm and an interval I and produces a new algorithm whose regret on I is O( p

|I|) without changing the regret on any other interval. By applying this construction for each interval in S, we develop our desired algorithm. We outline how to achieve this in Section 3. Our main Theorem is then presented in Theorem 6. Finally, we show that this approach to strongly-adaptive regret also implies optimal dynamic regret in Theorem 7.

3. Residual Learning

In this section we show how to take an algorithm A and an interval [a, b] and produce a new algorithm that never has worse regret than A, but is also guaranteed to have low regret on the given interval. To gain some intuition for

Parameter-free, Dynamic, and Strongly-Adaptive Online Learning

how our technique works, let us for the moment ignore the constraint set W and allow our algorithm to be improper, in the sense that its outputs wt may lie outside W. Then, one ﬁrst idea is to have a second learner B run only on the interval [a, b] on the losses ℓt(w) = gt, w xt , where xt is the tth output of A. In some sense, B is attempting to minimize the residual error left by the ﬁrst learner. If yt is the output of B, we might think to play wt = xt + yt. Intuitively, if xt is doing poorly, yt can pick up the slack and vice versa. Unfortunately, this technique does not quite work because now B s comparison points are essentially wt + xt, which can have a much larger path-length than the original wt. Note that in the case of non-dynamic regret and the ﬁxed interval [1, T], (Cutkosky, 2019) avoided this problem by assuming that A has a reasonable regret bound, which implies that the xt have some nice properties. However, in our case we do not wish to assume anything about the loss of A on the interval [a, b], and so we need a more delicate approach.

Our key idea to ﬁx the problem with adding iterates is to introduce an extra dimension. B will output points in W [0, 1] Rd+1, and its loss at time point t is given by:

ℓt(y, z) = gt, y z gt, xt

Notice that ℓt(y, z) is convex (in fact, linear) in (y, z). Then at time t, if (yt, zt) is the output of the second learner, we play the point wt = xt + yt ztxt. Using this method, ℓt has the following powerful properties:

gt, wt wt = ℓt(yt, zt) ℓt( wt, 1) (2)

gt, wt wt = gt, xt w + ℓt(yt, zt) ℓt(0, 0) (3)

The ﬁrst fact implies that the total regret over the interval [a, b] is equal to the regret of B on its losses ℓt, which is under control as long as B obtains a good regret guarantee. The second fact implies that the regret over the interval [a, b] is equal to the regret of A on the same interval plus the regret of B to the constant comparison point (0, 0). This is where we invoke the existence of parameter-free algorithms. By setting B to be the algorithm speciﬁed by Theorem 1 that obtains constant regret with respect to the origin, we control this extra regret and so experience very little extra loss over the regret of A.

Note that this intuition does not deal with the issue of enforcing wt W. To address this issue, we take a brief detour to discuss a general method for applying contraints to online learning algorithms.

4. Adding Constraints

In order to remove the improperness from the approach outlined in the previous section, we need some way to enforce the constraint wt W. Our strategy for this is

Algorithm 1 Varying Constraints

Input: online learning algorithm A. Sequence of domains V1, . . . , VT contained Rd. for t = 1 . . . T do

Get wt V from A. Deﬁne the function Πt(x) = argminw Vt w x . Output ˆwt = Πt(wt). Get loss gt. Let vt = wt ˆ wt wt ˆ wt . Deﬁne the function St(w) = w Πt(w) . Deﬁne ˆℓt(w) by:

gt, w if gt, wt gt, ˆwt gt, w gt, vt St(w) if gt, wt < gt, ˆwt

Compute ˆgt ˆℓt(wt). Send ˆgt to A as tth loss. end for

a generalization of the reduction between unconstrained and constrained online learning proposed by (Cutkosky & Orabona, 2018). Their method takes the prediction wt of an unconstrained algorithm and outputs the constrained prediction Π(wt) = argminw W w wt . Then the original algorithm is provided with the surrogate loss function ℓt(w) = gt, w + gt S(w), where S(x) = x Π(w) . The additional term gt S(w) serves as a penalty for attempting to violate the constraints. This penalty is carefully designed in such a way that the regret of the unconstrained algorithm on the penalized losses is an upper bound on the regret of the constrained points on the true losses. We make two mild improvements upon their reduction. First, we design a better penalty term that allows us to remove an extra factor of two from the original regret analysis. Second, we observe that it is possible to change the constraint set at every round, so long as the comparison points wt always lie with in tth constraint. The algorithm and analysis are present in Algorithm 1 and Theorem 2.

Theorem 2. The functions ˆℓt deﬁned in Algorithm 1 are convex functions deﬁned on all of V and the gradients ˆgt sent to A by Algorithm 1 satisfy ˆgt gt . Also, for all t and all w Vt we have

gt, ˆwt w ˆℓt(wt) ˆℓt( w) ˆgt, wt w

Proof. First, note depending on the values of wt, ˆwt and gt, ˆℓt is either gt, w for all w, or gt, w gt, vt S(w) for all w. In the former case, we have ˆgt = gt and so all claims in the statement are immediate. Let us focus on the latter case. We have from (Cutkosky & Orabona, 2018) Proposition 1 that St(w) is a 1-Lipschitz convex function, so that to show that ˆℓt is convex it sufﬁces to show that gt, vt 0. To this end, observe that we can write wt = ˆwt + αvt where α =

Parameter-free, Dynamic, and Strongly-Adaptive Online Learning

wt ˆwt = S(wt) 0. Therefore, if gt, wt < gt, ˆwt , we must have gt, vt < 0 so that ˆℓt is convex. Further, we have St(wt) = {vt} by Theorem 4 of (Cutkosky & Orabona, 2018). Therefore ˆgt = gt gt, vt vt. Since vt = 1, by triangle inequality we have ˆgt 2 gt . Then wse have that ˆgt is orthogonal projection of gt onto subspace perpendicular to vt, so that ˆgt gt .

Next, notice that if ˆgt, wt < gt, ˆwt , we have

gt, ˆwt = gt, wt + gt, ˆwt wt

gt, wt α gt, vt

This implies gt, ˆwt w = ˆℓt(wt) ˆℓt( w) ˆgt, wt w as desired.

Theorem 2 allows us to add arbitrary time-varying constraints to an online learning algorithm without damaging regret bounds, even if those bounds happen to depend on the values of gt . Using this, we can ﬁx the approach outlined in Section 3. Recall that given xt, we have a second algorithm B that outputs (yt, zt) in response to losses

ℓt(y, z) = gt, y gt, xt z

and our ﬁnal output is wt = xt + yt ztxt. In order to guarantee that wt W, we use the fact that xt W and deﬁne the set

Vt = {(y, z) : xt + y zxt W, z [0, 1], y W}

Notice that Vt is a convex set, and (0, 0) and (w, 1) are both in Vt for all values of w W. Therefore we can simply constrain B to play within the set Vt, and by deﬁnition we will have wt = xt + yt ztxt W. Moreover, since (0, 0) and ( wt, 1) are both in Vt, we can still make use of the important properties (2) and (3). The full algorithm is described in Algorithm 2 and the analysis is in Theorem 3 below:

Theorem 3. Let A be an online learner that outputs xt W in response to losses gt. Let I1, . . . , IK be some disjoint intervals in [1, T]. Let J be any interval such that for all k, either J contains Ik or is disjoint from Ik. Then given ϵ > 0, Algorithm 2 guarantees regret:

t J gt, wt wt RA J ( w) + ϵ

where RA J ( w) = P

t J gt, xt wt is the regret of A on the interval J. Further, for each interval Ik, we have RIk( w) bounded by:

t Ik gt 2 log KD P

where D = supx,y W x y and P[a,b] = Pb 1 t=a | wt+1 wt .

Algorithm 2 Residual Algorithm

Input: Intervals I1, . . . , IK, online learner A, number ϵ > 0. Initialize algorithm B from Algorithm 1 using the algorithm of Theorem 1 as the base unconstrained algorithm, with ψ = ϵ/K. for t = 1 . . . T do

Get xt W from A. if t / Ik for some k then

Output wt = xt. Get gt and send it to A. else

If t is at the beginning of some Ik, reinitialize B. Let k be such that t Ik, with Ik = [a, b]. Set t = t a + 1. Set Vt = {(y, z) Rd R : xt + y zxt W and z [0, 1]}. Send Vt to B as t th constraint set. Set (yt, zt) to the t th output of B. Output wt = xt + yt zt xt W. Get loss gt. Send gt to A as tth loss. Send (gt, gt, xt ) to B as t th loss. end if end for

Proof. For the ﬁrst statement, we break up the interval J into disjoint components that are either equal to some Ik, or have no intersection with any Ik. Observe that the total regret over J is just the sum of the regrets over these components. For any component C that has no intersection with Ik, we have wt = xt, where xt is the output of A. Therefore the regret obtained over the rounds that do not lie in any Ik is identical to the regret of A over these rounds.

Next, we compute the regret over any interval Ik. Let us write Ik = [a, b]. Observe that in Ik, wt = xt + yt ztxt where xt is the output of A and (yt, zt) is the t a + 1st the output of B when running over Ik. Therefore we have

RIk( w) = X

t Ik gt, xt + yt ztxt wt

t Ik gt, xt wt + X

t Ik gt, yt ztxt

= RA Ik( w) +

t=a gt, yt 0 + gt, xt (zt 0)

Now, notice that Pb t=a gt, yt 0 + gt, xt (zt 0) is the (non-dynamic) regret of B with respect to (0, 0). Then, for all t, (0, 0) Vt so that by Theorem 1 and Theorem

Parameter-free, Dynamic, and Strongly-Adaptive Online Learning

2, we have that the regret is at most ϵ/K. Therefore since there are only K total intervals Ik, summing up these regret bounds proves the ﬁrst statement of the Theorem.

For the second statement, we again write Ik = [a, b] and compute:

t=a gt, xt + yt ztxt wt

t=a gt, yt wt + gt, xt (zt 1)

Therefore the regret is the dynamic regret of B with respect to the sequence of comparators ( wa, 1), . . . , ( wb, 1). We analyze this dynamic regret separately in Theorem 4 below, from which the result follows.

Theorem 4. Suppose V1, . . . , VT is a sequence of convex domains, all of which have bounded diameter D. Suppose w1, . . . , w T is a sequence of points such that wt Vt for all t. Then if we run Algorithm 1 with base algorithm A equal to the algorithm of Theorem 1, the combined method achieves dynamic regret:

t=1 gt, ˆwt wt

G log w T G

+ C w log w T G

+ PT + PT C p

G log ((t Dϵ 1 + 1)G)

+ PT C log (t Dϵ 1 + 1)G

= O h ( w T + PT ) p

G log ((t Dϵ 1 + 1)G)

+( w T + PT ) log (t Dϵ 1 + 1)G

where G = 1 + PT t=1 gt 2, PT = PT 1 t=1 wt+1 wt , and we follow the notation in the pseudocode to use ˆwt to indicate outputs of the algorithm.

Proof. Following the notation of Algorithm 1, let wt be the ouputs of the unconstrained algorithm A, and deﬁne RA [1,T ] to be the regret of A. Then we have

t=1 gt, ˆwt wt

t=1 ˆgt, wt wt

t=1 ˆgt, wt w T +

t=1 ˆgt, ( w T i w T i+1)

= RA [1,T ]( w T ) + PT max t

Next, we show that ˆgt, wt ˆgt, ˆwt . There are two cases, either gt, wt gt, ˆwt or not. In the former case,

we have gt = ˆgt and so the statement follows. In the latter case, we write:

ˆgt, wt ˆwt ˆℓt(wt) ˆℓt( ˆwt)

= gt, wt gt, vt wt ˆwt gt, ˆwt

ˆgt, wt ˆgt, ˆwt + gt, wt ˆwt

gt, vt wt ˆwt

ˆgt, wt ˆgt, ˆwt + gt, wt ˆwt

gt, (wt ˆwt)

ˆgt, wt ˆgt, ˆwt

so that the statement still follows.

Now we have for any X 0,

i=1 ˆgi, ˆwi + X

i=1 ˆgi, wi + X

+ CX log XG

inf X ϵ + Dt

Set X = ϵ + Dt to obtain:

G log ((t Dϵ 1 + 1)G)

+ C log (t Dϵ 1 + 1)G

And now put all these calculations together to prove the Theorem.

5. Main Result

In this section we prove our main result by showing how to apply Theorem 3 to build a strongly-adaptive algorithm. Our approach uses the geometric covering intervals suggested by (Daniely et al., 2015). Let N be the smallest integer such that T 2N. Note that N = O(log(T)). For each i = 0, 1, . . . , N, we maintain a set of disjoint intervals Si, such that S

I Si I = [1, T]. The set Si consists of all intervals of length 2i starting at a multiple of 2i. Notice that

Parameter-free, Dynamic, and Strongly-Adaptive Online Learning

Si contains at most O(T/2i) intervals. Also, observe that any given index t is contained within at most one interval in each Si. We write S = S

The key property of these intervals is the following, proved in Lemma 5 of (Daniely et al., 2015):

Proposition 5. Let [a, b] be any interval contained in [1, T]. Then [a, b] can be written as a disjoint union of at most O(log2(b a)) intervals such that each interval is in some Si and no Si contributes more than 2 intervals to the disjoint union.

Using this Proposition, we build our algorithm in stages. Speciﬁcally, we will construct a sequence of algorithms AN, AN 1, . . . , A1 such that AN is the algorithm provided by Theorem 4, and A1 is an algorithm that obtains the desired regret guarantees. Formally, we have the following Theorem:

Theorem 6. Let τ be the time required to project to sets of the form Vt as deﬁned in Algorithm 3. Then there exists an algorithm that runs in O(d log(T)τ) time per round such that the dynamic regret RI( w) over any interval I is:

(D + PI) log

t I gt 2 ! s

Moreover, the same algorithm achieves non-dynamic regret:

RI( w) O h D(log2(T) + p

|I| log(T)) i

where |I| is the length of the interval I. Finally, the same algorithm also achieves non-dynamic regret bounding R[1,T ]( w) by:

t=1 gt 2 log

Proof. Our ﬁrst goal is to build an algorithm such that for any interval J S, we have the slightly better regret bound:

v u u t(1 + X

t J gt 2) log

Suppose we have this result. Then recall that any inverval I can be written as a disjoint union of O(log(I)) intervals in S. Further, the regret over the entire interval I is obtained by summing the regret obtained on each interval in the disjoint union. Then applying the Cauchy-Schwarz inequality yields the ﬁrst statement of the Theorem. For the second statement,

notice that since gt 1, we must also have for any interval J S,

RJ( w) O h (D + PJ) log (T|J|) + p

|J| log (T|J|) i

Given any interval I, we write I as the disjoint union of intervals speciﬁed by Proposition 5 and sum the regret over the intervals, just as we did to obtain the ﬁrst statement. However, now that we have abandoned dependence on gt , we can bound the sum more efﬁciently than with Cauchy Schwarz. Speciﬁcally, since no Si contributes more than 4 intervals to the disjoint union, the total regret is bounded by:

1+log2(I) X

k=1 (D + PI) log T2k + q

2k log (T2k)

which yields the desired expression. Therefore, it sufﬁces to design an algorithm that achieves (4) for any J S.

Our construction builds a sequence of algorithms AN, . . . , A1 in such a way that Ai will satisfy (4) for any interval J in Sj for j i. Thus A1 will satisfy the regret bound for any J S. To start, we set AN to be the algorithm posited by Theorem 4, with each Vt set to be W and ψ = 1. Now we build Ai from Ai+1 inductively. Observe that every interval in Si either disjoint from or completely contained within any interval in Sj for j i. Therefore we apply the construction of Theorem 3 to Ai 1 with the intervals in Si and ϵ = 1 to obtain Ai. Theorem 3 then implies the bound (4), which we have seen implies the ﬁrst two claims of the Theorem. Further, since by Theorem 4, AN obtains the bound:

R[1,T ]( w) O

t=1 gt 2 log

and since N = log(T), Theorem 3 also implies that A1 satisﬁes the last claim of the Theorem as well.

6. Optimal Dynamic Regret

The bound of Theorem 6 achieves regret linear in the pathlength of the comparator PI, but our goal is to obtain O( PI). This is the optimal rate, as shown by (Zhang et al., 2018a). In this section, we show that in fact a stronglyadaptive guarantee that is linear in PI actually implies a strongly-adaptive guarantee that depends instead on PI:

Theorem 7. The algorithm described by Theorem 6 also achieves for any interval:

D(PI + D) X

Further, we have PI p

Parameter-free, Dynamic, and Strongly-Adaptive Online Learning

Proof. First, we show that it is possible to break the interval I up into disjoint subintervals I = J1 JK such that for each i, PJi 2D, and K PI+D

D . We do this by explicitly building these subintervals via an iterative greedy construction. Let I = [a, b]. Our intervals will satisfy Ji = [ti 1, ti] where a = t0 < t1 < < t K = b. Let J1 = [a, t1] where t1 is the smallest index such that PJ1 D. This implies P[a,t1 1] < D, and so therefore PJ1 2D since the diameter of W is D. Now, given ti 1, let ti be the smallest index such that P[ti 1,ti] D, and set Ji = [ti 1, ti]. Then again we will have PJi 2D. If no such ti exists, set i = K and ti = b. By this construction, PJi 2D for all i (including i = K, for which PJK D). Further, we have that PJi D for all i < K and PK i=1 PJi PI. Therefore:

i=1 PJi (K 1)D

Therefore we have a set of intervals Ji satisﬁying the desired properties.

Now, on each of these intervals Ji, we have the regret bound

(D + PJi) s

So we sum over all i and use Cauchy-Schwarz:

i=1 RJi( w)

v u u t D(PI + D)

where in the last line we used K PI+D

D . Finally, observe that PI D|I| to obtain PI p

This optimal dependence on the path-length PI was previous achieved by (Zhang et al., 2018a). Their algorithm only achieves this regret for the entire interval [1, T], while we obtain it for any sub-interval. Both our algorithm and that of (Zhang et al., 2018a) require O(log(T)) time per update. However, we note that our proof of Theorem 7 is rather general, and we believe that in fact the same technique may be

used to show that all known strongly-adaptive algorithms todate also achieve the optimal dynamic regret. Interestingly, however, our Theorem 7 is not a generic reduction showing that all strongly-adaptive algorithms must also achieve optimal dynamic regret. Athough (Zhang et al., 2018b) show many interesting reductions between strongly-adaptive algorithms and various other restricted formulations of dynamic regret, to the best of our knowledge the question of whether strongly-adaptive algorithms must optimally adapt to the path-length is still unanswered.

7. Conclusion

We have introduced a new algorithm that combines several desirable notions of adaptivity. First, our algorithm obtains regret

which is the optimal adaptivity to the norm of w and the gradients gt. Intuitively, this is the same bound that would be obtained by an optimally tuned online gradient descent, but we do not require any manual tuning.

Secondly, our algorithm obtains strongly-adaptive regret: for any interval [a, b], we have regret

t=a gt, wt w O

where D is the diameter of W. The only other algorithm we are aware of that achieves strong-adaptivity while also obtaining a bound that depends on PT t=1 gt 2 is that of (Zhang et al., 2019b), which incurs an extra factor of O(log(T)) in the run-time due to running an instance of Metagrad (van Erven & Koolen, 2016) or Maler (Wang et al., 2019) as a subroutine.

Finally, our algorithm obtains optimal dynamic regret over any interval:

t=a gt, wt wt O

v u u t(D2 + DP)

where P = Pb 1 t=a wt wt+1 . In this case, we believe that our analysis can actually show that all known stronglyadaptive algorithms actually achieve optimal dependence on the path length P.

We believe our techniques exhibit a qualitative difference from prior approaches: instead of building our algorithm by using a sleeping-expert meta-algorithm to combine many base algorithms, we show a way to iteratively build up an

Parameter-free, Dynamic, and Strongly-Adaptive Online Learning

algorithm with the desired regret bound. One might be able to view this new approach as in some sense bundling the meta-algorithm s operation into each of the individual base algorithms. We feel that our approach has tangible beneﬁts: for example, it is not obvious how to use any prior sleepingexperts-based algorithm to maintain (5) while also obtaining strong-adaptivity.

Several intriguing open questions remain. First, can we replace the dependencies on D in the strongly-adaptive and dynamic regret bounds with some dependence on w instead? Second, our regret bounds depend on log(T), even when the interval [a, b] in question is much smaller than T. Is it possible to have a bound that depends only on log(b a)? Finally, is it possible to improve the O(log(T)) runtime per update currently required by all known stronglyadaptive algorithms?

Abernethy, J., Bartlett, P. L., Rakhlin, A., and Tewari, A. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the nineteenth annual conference on computational learning theory, 2008.

Cutkosky, A. Combining online learning guarantees. In Proceedings of the Thirty-Second Conference on Learning Theory, pp. 895 913, 2019.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pp. 1493 1529, 2018.

Cutkosky, A. and Sarlos, T. Matrix-free preconditioning in online learning. In International Conference on Machine Learning, pp. 1455 1464, 2019.

Daniely, A., Gonen, A., and Shalev-Shwartz, S. Strongly adaptive online learning. In International Conference on Machine Learning, pp. 1405 1411, 2015.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory (COLT), 2010.

Foster, D. J., Rakhlin, A., and Sridharan, K. Online learning: Sufﬁcient statistics and the burkholder method. In Conference on Learning Theory (COLT), 2018.

Gyorgy, A. and Szepesv ari, C. Shifting regret, mirror descent, and matrices. In International Conference on Machine Learning, pp. 2943 2951, 2016.

Gyorgy, A., Linder, T., and Lugosi, G. Efﬁcient tracking of large classes of experts. IEEE Transactions on Information Theory, 58(11):6709 6725, 2012.

Hazan, E., Rakhlin, A., and Bartlett, P. L. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, pp. 65 72, 2008.

Herbster, M. and Warmuth, M. K. Tracking the best expert. Machine learning, 32(2):151 178, 1998.

Jun, K.-S., Orabona, F., Wright, S., and Willett, R. Improved strongly adaptive online learning using coin betting. In Artiﬁcial Intelligence and Statistics, pp. 943 951, 2017.

Kempka, M., Kotlowski, W., and Warmuth, M. K. Adaptive scale-invariant online algorithms for learning linear models. In International Conference on Machine Learning, pp. 3321 3330, 2019.

Mcmahan, B. and Streeter, M. No-regret algorithms for unconstrained online convex optimization. In Advances in neural information processing systems, pp. 2402 2410, 2012.

Mc Mahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.

Mhammedi, Z. and Koolen, W. M. Lipschitz and comparator-norm adaptivity in online learning. Conference on Learning Theory, 2020.

Orabona, F. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pp. 1116 1124, 2014.

Orabona, F. and P al, D. Coin betting and parameter-free online learning. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 577 585. Curran Associates, Inc., 2016.

Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011.

van der Hoeven, D. User-speciﬁed local differential privacy in unconstrained adaptive online learning. In Advances in Neural Information Processing Systems, pp. 14080 14089, 2019.

van Erven, T. and Koolen, W. M. Metagrad: Multiple learning rates in online learning. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 3666 3674. Curran Associates, Inc., 2016.

Wang, G., Lu, S., and Zhang, L. Adaptivity and optimality: A universal algorithm for online convex optimization. ar Xiv preprint ar Xiv:1905.05917, 2019.

Parameter-free, Dynamic, and Strongly-Adaptive Online Learning

Zhang, L., Lu, S., and Zhou, Z.-H. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems, pp. 1323 1333, 2018a.

Zhang, L., Yang, T., Zhou, Z.-H., et al. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning, pp. 5877 5886, 2018b.

Zhang, L., Liu, T.-Y., and Zhou, Z.-H. Adaptive regret of convex and smooth functions. In International Conference on Machine Learning, pp. 7414 7423, 2019a.

Zhang, L., Wang, G., Tu, W.-W., and Zhou, Z.-H. Dual adaptivity: A universal algorithm for minimizing the adaptive regret of convex functions. ar Xiv preprint ar Xiv:1906.10851, 2019b.

Zinkevich, M. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 928 936, 2003.