# adaptive_online_learning_in_dynamic_environments__ba5d901f.pdf

Adaptive Online Learning in Dynamic Environments

Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

In this paper, we study online convex optimization in dynamic environments, and aim to bound the dynamic regret with respect to any sequence of comparators. Existing work have shown that online gradient descent enjoys an O(

T(1 + PT )) dynamic regret, where T is the number of iterations and PT is the path-length of the comparator sequence. However, this result is unsatisfactory, as there exists a large gap from the Ω( p

T(1 + PT )) lower bound established in our paper. To address this limitation, we develop a novel online method, namely adaptive learning for dynamic environment (Ader), which achieves an optimal O( p

T(1 + PT )) dynamic regret. The basic idea is to maintain a set of experts, each attaining an optimal dynamic regret for a speciﬁc path-length, and combines them with an expert-tracking algorithm. Furthermore, we propose an improved Ader based on the surrogate loss, and in this way the number of gradient evaluations per round is reduced from O(log T) to 1. Finally, we extend Ader to the setting that a sequence of dynamical models is available to characterize the comparators.

1 Introduction

Online convex optimization (OCO) has become a popular learning framework for modeling various real-world problems, such as online routing, ad selection for search engines and spam ﬁltering [Hazan, 2016]. The protocol of OCO is as follows: At iteration t, the online learner chooses xt from a convex set X. After the learner has committed to this choice, a convex cost function ft : X 7 R is revealed. Then, the learner suffers an instantaneous loss ft(xt), and the goal is to minimize the cumulative loss over T iterations. The standard performance measure of OCO is regret:

t=1 ft(xt) min x X

t=1 ft(x) (1)

which is the cumulative loss of the learner minus that of the best constant point chosen in hindsight.

The notion of regret has been extensively studied, and there exist plenty of algorithms and theories for minimizing regret [Shalev-Shwartz et al., 2007, Hazan et al., 2007, Srebro et al., 2010, Duchi et al., 2011, Shalev-Shwartz, 2011, Zhang et al., 2013]. However, when the environment is changing, the traditional regret is no longer a suitable measure, since it compares the learner against a static point. To address this limitation, recent advances in online learning have introduced an enhanced measure dynamic regret, which received considerable research interest over the years [Hall and Willett, 2013, Jadbabaie et al., 2015, Mokhtari et al., 2016, Yang et al., 2016, Zhang et al., 2017].

In the literature, there are two different forms of dynamic regret. The general one is introduced by Zinkevich [2003], who proposes to compare the cumulative loss of the learner against any sequence

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

of comparators

R(u1, . . . , u T ) =

t=1 ft(ut) (2)

where u1, . . . , u T X. Instead of following the deﬁnition in (2), most of existing studies on dynamic regret consider a restricted form, in which the sequence of comparators consists of local minimizers of online functions [Besbes et al., 2015], i.e.,

R(x 1, . . . , x T ) =

t=1 ft(x t ) =

t=1 min x X ft(x) (3)

where x t argminx X ft(x) is a minimizer of ft( ) over domain X. Note that although R(u1, . . . , u T ) R(x 1, . . . , x T ), it does not mean the notion of R(x 1, . . . , x T ) is stronger, because an upper bound for R(x 1, . . . , x T ) could be very loose for R(u1, . . . , u T ).

The general dynamic regret in (2) includes the static regret in (1) and the restricted dynamic regret in (3) as special cases. Thus, minimizing the general dynamic regret can automatically adapt to the nature of environments, stationary or dynamic. In contrast, the restricted dynamic regret is too pessimistic, and unsuitable for stationary problems. For example, it is meaningless to the problem of statistical machine learning, where ft s are sampled independently from the same distribution. Due to the random perturbation caused by sampling, the local minimizers could differ signiﬁcantly from the global minimizer of the expected loss. In this case, minimizing (3) will lead to overﬁtting.

Because of its ﬂexibility, we focus on the general dynamic regret in this paper. Bounding the general dynamic regret is very challenging, because we need to establish a universal guarantee that holds for any sequence of comparators. By comparison, when bounding the restricted dynamic regret, we only need to focus on the local minimizers. Till now, we have very limited knowledge on the general dynamic regret. One result is given by Zinkevich [2003], who demonstrates that online gradient descent (OGD) achieves the following dynamic regret bound

R(u1, . . . , u T ) = O

T(1 + PT ) (4)

where PT , deﬁned in (5), is the path-length of u1, . . . , u T .

However, the linear dependence on PT in (4) is too loose, and there is a large gap between the upper bound and the Ω( p

T(1 + PT )) lower bound established in our paper. To address this limitation, we propose a novel online method, namely adaptive learning for dynamic environment (Ader), which attains an O( p

T(1 + PT )) dynamic regret. Ader follows the framework of learning with expert advice [Cesa-Bianchi and Lugosi, 2006], and is inspired by the strategy of maintaining multiple learning rates in Meta Grad [van Erven and Koolen, 2016]. The basic idea is to run multiple OGD algorithms in parallel, each with a different step size that is optimal for a speciﬁc path-length, and combine them with an expert-tracking algorithm. While the basic version of Ader needs to query the gradient O(log T) times in each round, we develop an improved version based on surrogate loss and reduce the number of gradient evaluations to 1. Finally, we provide extensions of Ader to the case that a sequence of dynamical models is given, and obtain tighter bounds when the comparator sequence follows the dynamical models closely.

The contributions of this paper are summarized below.

We establish the ﬁrst lower bound for the general regret bound in (2), which is Ω( p

T(1 + PT )). We develop a serial of novel methods for minimizing the general dynamic regret, and prove an optimal O( p

T(1 + PT )) upper bound. Compared to existing work for the restricted dynamic regret in (3), our result is universal in the sense that the regret bound holds for any sequence of comparators. Our result is also adaptive because the upper bound depends on the path-length of the comparator sequence, so it automatically becomes small when comparators change slowly.

2 Related Work

In this section, we provide a brief review of related work in online convex optimization.

2.1 Static Regret

In static setting, online gradient descent (OGD) achieves an O(

T) regret bound for general convex functions. If the online functions have additional curvature properties, then faster rates are attainable. For strongly convex functions, the regret bound of OGD becomes O(log T) [Shalev-Shwartz et al., 2007]. The O(

T) and O(log T) regret bounds, for convex and strongly convex functions respectively, are known to be minimax optimal [Abernethy et al., 2008]. For exponentially concave functions, Online Newton Step (ONS) enjoys an O(d log T) regret, where d is the dimensionality [Hazan et al., 2007]. When the online functions are both smooth and convex, the regret bound could also be improved if the cumulative loss of the optimal prediction is small [Srebro et al., 2010].

2.2 Dynamic Regret

To the best of our knowledge, there are only two studies that investigate the general dynamic regret [Zinkevich, 2003, Hall and Willett, 2013]. While it is impossible to achieve a sublinear dynamic regret in general, we can bound the dynamic regret in terms of certain regularity of the comparator sequence or the function sequence. Zinkevich [2003] introduces the path-length

PT (u1, . . . , u T ) =

t=2 ut ut 1 2 (5)

and provides an upper bound for OGD in (4). In a subsequent work, Hall and Willett [2013] propose a variant of path-length

P T (u1, . . . , u T ) =

t=1 ut+1 Φt(ut) 2 (6)

in which a sequence of dynamical models Φt( ) : X 7 X is incorporated. Then, they develop a new method, dynamic mirror descent, which achieves an O(

T(1 + P T )) dynamic regret. When the comparator sequence follows the dynamical models closely, P T could be much smaller than PT , and thus the upper bound of Hall and Willett [2013] could be tighter than that of Zinkevich [2003].

For the restricted dynamic regret, a powerful baseline, which simply plays the minimizer of previous round, i.e., xt+1 = argminx X ft(x), attains an O(P T ) dynamic regret [Yang et al., 2016], where

P T := PT (x 1, . . . , x T ) =

t=2 x t x t 1 2.

OGD also achieves the O(P T ) dynamic regret, when the online functions are strongly convex and smooth [Mokhtari et al., 2016], or when they are convex and smooth and all the minimizers lie in the interior of X [Yang et al., 2016]. Another regularity of the comparator sequence is the squared path-length

S T := ST (x 1, . . . , x T ) =

t=2 x t x t 1 2 2

which could be smaller than the path-length P T when local minimizers move slowly. Zhang et al. [2017] propose online multiple gradient descent, and establish an O(min(P T , S T )) regret bound for (semi-)strongly convex and smooth functions.

In a recent work, Besbes et al. [2015] introduce the functional variation

FT := F(f1, . . . , f T ) =

t=2 max x X |ft(x) ft 1(x)|

to measure the complexity of the function sequence. Under the assumption that an upper bound VT FT is known beforehand, Besbes et al. [2015] develop a restarted online gradient descent, and prove its dynamic regret is upper bounded by O(T 2/3(VT + 1)1/3) and O(log T p

T(VT + 1)) for convex functions and strongly convex functions, respectively. One limitation of this work is that the bounds are not adaptive because they depend on the upper bound VT . So, even when the actual functional variation FT is small, the regret bounds do not become better.

One regularity that involves the gradient of functions is

t=1 ft(xt) mt 2 2

where m1, . . . , m T is a predictable sequence computable by the learner [Chiang et al., 2012, Rakhlin and Sridharan, 2013]. From the above discussions, we observe that there are different types of regularities. As shown by Jadbabaie et al. [2015], these regularities reﬂect distinct aspects of the online problem, and are not comparable in general. To take advantage of the smaller regularity, Jadbabaie et al. [2015] develop an adaptive method whose dynamic regret is on the order of DT + 1 + min{ p

(DT + 1)P T , (DT + 1)1/3T 1/3F 1/3 T }. However, it relies on the assumption that the learner can calculate each regularity online.

2.3 Adaptive Regret

Another way to deal with changing environments is to minimize the adaptive regret, which is deﬁned as maximum static regret over any contiguous time interval [Hazan and Seshadhri, 2007]. For convex functions and exponentially concave functions, Hazan and Seshadhri [2007] have developed efﬁcient algorithms that achieve O( p

T log3 T) and O(d log2 T) adaptive regrets, respectively. Later, the adaptive regret of convex functions is improved [Daniely et al., 2015, Jun et al., 2017]. The relation between adaptive regret and restricted dynamic regret is investigated by Zhang et al. [2018b].

3 Our Methods

We ﬁrst state assumptions about the online problem, then provide our motivations, including a lower bound of the general dynamic regret, and ﬁnally present the proposed methods as well as their theoretical guarantees. All the proofs can be found in the full paper [Zhang et al., 2018a].

3.1 Assumptions

Similar to previous studies in online learning, we introduce the following common assumptions.

Assumption 1 On domain X, the values of all functions belong to the range [a, a + c], i.e., a ft(x) a + c, x X, and t [T].

Assumption 2 The gradients of all functions are bounded by G, i.e., max x X ft(x) 2 G, t [T]. (7)

Assumption 3 The domain X contains the origin 0, and its diameter is bounded by D, i.e., max x,x X x x 2 D. (8)

Note that Assumptions 2 and 3 imply Assumption 1 with any c GD. In the following, we assume the values of G and D are known to the leaner.

3.2 Motivations

According to Theorem 2 of Zinkevich [2003], we have the following dynamic regret bound for online gradient descent (OGD) with a constant step size.

Theorem 1 Consider the online gradient descent (OGD) with x1 X and xt+1 = ΠX xt η ft(xt) , t 1

where ΠX [ ] denotes the projection onto the nearest point in X. Under Assumptions 2 and 3, OGD satisﬁes

t=1 ft(ut) 7D2

t=2 ut 1 ut 2 + ηTG2

for any comparator sequence u1, . . . , u T X.

Thus, by choosing η = O(1/

T), OGD achieves an O(

T(1 + PT )) dynamic regret, that is universal. However, this upper bound is far from the Ω( p

T(1 + PT )) lower bound indicated by the theorem below.

Theorem 2 For any online algorithm and any τ [0, TD], there exists a sequence of comparators u1, . . . , u T satisfying Assumption 3 and a sequence of functions f1, . . . , f T satisfying Assumption 2, such that PT (u1, . . . , u T ) τ and R(u1, . . . , u T ) = Ω G p

T(D2 + Dτ) .

Although there exist lower bounds for the restricted dynamic regret [Besbes et al., 2015, Yang et al., 2016], to the best of our knowledge, this is the ﬁrst lower bound for the general dynamic regret.

Let s drop the universal property for the moment, and suppose we only want to compare against a speciﬁc sequence u1, . . . , u T X whose path-length P T = PT t=2 ut ut 1 2 is known

beforehand. In this simple setting, we can tune the step size optimally as η = O( q

(1 + P T )/T)

and obtain an improved O( q

T(1 + P T )) dynamic regret bound, which matches the lower bound in Theorem 2. Thus, when bounding the general dynamic regret, we face the following challenge: On one hand, we want the regret bound to hold for any sequence of comparators, but on the other hand, to get a tighter bound, we need to tune the step size for a speciﬁc path-length. In the next section, we address this dilemma by running multiple OGD algorithms with different step sizes, and combining them through a meta-algorithm.

3.3 The Basic Approach

Our proposed method, named as adaptive learning for dynamic environment (Ader), is inspired by a recent work for online learning with multiple types of functions Meta Grad [van Erven and Koolen, 2016]. Ader maintains a set of experts, each attaining an optimal dynamic regret for a different path-length, and chooses the best one using an expert-tracking algorithm.

Meta-algorithm Tracking the best expert is a well-studied problem [Herbster and Warmuth, 1998], and our meta-algorithm, summarized in Algorithm 1, is built upon the exponentially weighted average forecaster [Cesa-Bianchi and Lugosi, 2006]. The inputs of the meta-algorithm are its own step size α, and a set H of step sizes for experts. In Step 1, we active a set of experts {Eη|η H} by invoking the expert-algorithm for each η H. In Step 2, we set the initial weight of each expert. Let ηi be the i-th smallest step size in H. The weight of Eηi is chosen as

wηi 1 = C i(i + 1), and C = 1 + 1

In each round, the meta-algorithm receives a set of predictions {xη t |η H} from all experts (Step 4), and outputs the weighted average (Step 5):

η H wη t xη t

where wη t is the weight assigned to expert Eη. After observing the loss function, the weights of experts are updated according to the exponential weighting scheme (Step 7):

wη t+1 = wη t e αft(xη t ) P

µ H wµ t e αft(xµ t ) .

In the last step, we send the gradient ft(xη t ) to each expert Eη so that they can update their own predictions.

Expert-algorithm Experts are themselves algorithms, and our expert-algorithm, presented in Algorithm 2, is the standard online gradient descent (OGD). Each expert is an instance of OGD, and takes the step size η as its input. In Step 3 of Algorithm 2, each expert submits its prediction xη t to the meta-algorithm, and receives the gradient ft(xη t ) in Step 4. Then, in Step 5 it performs gradient descent xη t+1 = ΠX xη t η ft(xη t )

Algorithm 1 Ader: Meta-algorithm Require: A step size α, and a set H containing step sizes for experts 1: Activate a set of experts {Eη|η H} by invoking Algorithm 2 for each step size η H 2: Sort step sizes in ascending order η1 η2 ηN, and set wηi 1 = C i(i+1) 3: for t = 1, . . . , T do 4: Receive xη t from each expert Eη

5: Output xt = X

η H wη t xη t

6: Observe the loss function ft( ) 7: Update the weight of each expert by

wη t+1 = wη t e αft(xη t ) P

µ H wµ t e αft(xµ t )

8: Send gradient ft(xη t ) to each expert Eη

Algorithm 2 Ader: Expert-algorithm Require: The step size η 1: Let xη 1 be any point in X 2: for t = 1, . . . , T do 3: Submit xη t to the meta-algorithm 4: Receive gradient ft(xη t ) from the meta-algorithm 5: xη t+1 = ΠX xη t η ft(xη t )

to get the prediction for the next round.

Next, we specify the parameter setting and our dynamic regret. The set H is constructed in the way such that for any possible sequence of comparators, there exists a step size that is nearly optimal. To control the size of H, we use a geometric series with ratio 2. The value of α is tuned such that the upper bound is minimized. Speciﬁcally, we have the following theorem.

Theorem 3 Set

i = 1, . . . , N

where N = 1

2 log2(1 + 4T/7) + 1, and α = p

8/(Tc2) in Algorithm 1. Under Assumptions 1, 2 and 3, for any comparator sequence u1, . . . , u T X, our proposed Ader method satisﬁes

t=1 ft(ut) 3G

2T(7D2 + 4DPT ) + c

2T 4 [1 + 2 ln(k + 1)]

The order of the upper bound matches the Ω( p

T(1 + PT )) lower bound in Theorem 2 exactly.

3.4 An Improved Approach

The basic approach in Section 3.3 is simple, but it has an obvious limitation: From Steps 7 and 8 in Algorithm 1, we observe that the meta-algorithm needs to query the value and gradient of ft( )

N times in each round, where N = O(log T). In contrast, existing algorithms for minimizing static regret, such as OGD, only query the gradient once per iteration. When the function is complex, the evaluation of gradients or values could be expensive, and it is appealing to reduce the number of queries in each round.

Surrogate Loss We introduce surrogate loss [van Erven and Koolen, 2016] to replace the original loss function. From the ﬁrst-order condition of convexity [Boyd and Vandenberghe, 2004], we have

ft(x) ft(xt) + ft(xt), x xt , x X.

Then, we deﬁne the surrogate loss in the t-th iteration as

ℓt(x) = ft(xt), x xt (12)

and use it to update the prediction. Because

ft(xt) ft(ut) ℓt(xt) ℓt(ut), (13)

we conclude that the regret w.r.t. true losses ft s is smaller than that w.r.t. surrogate losses ℓt s. Thus, it is safe to replace ft with ℓt. The new method, named as improved Ader, is summarized in Algorithms 3 and 4.

Meta-algorithm The new meta-algorithm in Algorithm 3 differs from the old one in Algorithm 1 since Step 6. The new algorithm queries the gradient of ft( ) at xt, and then constructs the surrogate loss ℓt( ) in (12), which is used in subsequent steps. In Step 8, the weights of experts are updated based on ℓt( ), i.e.,

wη t+1 = wη t e αℓt(xη t ) P

µ H wµ t e αℓt(xµ t ) .

In Step 9, the gradient of ℓt( ) at xη t is sent to each expert Eη. Because the surrogate loss is linear,

ℓt(xη t ) = ft(xt), η H.

As a result, we only need to send the same ft(xt) to all experts. From the above descriptions, it is clear that the new algorithm only queries the gradient once in each iteration.

Expert-algorithm The new expert-algorithm in Algorithm 4 is almost the same as the previous one in Algorithm 2. The only difference is that in Step 4, the expert receives the gradient ft(xt), and uses it to perform gradient descent

xη t+1 = ΠX xη t η ft(xt)

We have the following theorem to bound the dynamic regret of the improved Ader.

Theorem 4 Use the construction of H in (10), and set α = p

2/(TG2D2) in Algorithm 3. Under Assumptions 2 and 3, for any comparator sequence u1, . . . , u T X, our improved Ader method satisﬁes

t=1 ft(ut) 3G

2T(7D2 + 4DPT ) + GD

2T 2 [1 + 2 ln(k + 1)]

where k is deﬁned in (11).

Similar to the basic approach, the improved Ader also achieves an O( p

T(1 + PT )) dynamic regret, that is universal and adaptive. The main advantage is that the improved Ader only needs to query the gradient of the online function once in each iteration.

Algorithm 3 Improved Ader: Meta-algorithm Require: A step size α, and a set H containing step sizes for experts 1: Activate a set of experts {Eη|η H} by invoking Algorithm 4 for each step size η H 2: Sort step sizes in ascending order η1 η2 ηN, and set wηi 1 = C i(i+1) 3: for t = 1, . . . , T do 4: Receive xη t from each expert Eη

5: Output xt = X

η H wη t xη t

6: Query the gradient of ft( ) at xt 7: Construct the surrogate loss ℓt( ) in (12) 8: Update the weight of each expert by

wη t+1 = wη t e αℓt(xη t ) P

µ H wµ t e αℓt(xµ t )

9: Send gradient ft(xt) to each expert Eη

10: end for

Algorithm 4 Improved Ader: Expert-algorithm Require: The step size η 1: Let xη 1 be any point in X 2: for t = 1, . . . , T do 3: Submit xη t to the meta-algorithm 4: Receive gradient ft(xt) from the meta-algorithm 5: xη t+1 = ΠX xη t η ft(xt)

3.5 Extensions

Following Hall and Willett [2013], we consider the case that the learner is given a sequence of dynamical models Φt( ) : X 7 X, which can be used to characterize the comparators we are interested in. Similar to Hall and Willett [2013], we assume each Φt( ) is a contraction mapping.

Assumption 4 All the dynamical models are contraction mappings, i.e.,

Φt(x) Φt(x ) 2 x x 2, (14)

for all t [T], and x, x X.

Then, we choose P T in (6) as the regularity of a comparator sequence, which measures how much it deviates from the given dynamics.

Algorithms For brevity, we only discuss how to incorporate the dynamical models into the basic Ader in Section 3.3, and the extension to the improved version can be done in the same way. In fact, we only need to modify the expert-algorithm, and the updated one is provided in Algorithm 5. To utilize the dynamical model, after performing gradient descent, i.e.,

xη t+1 = ΠX xη t η ft(xη t )

in Step 5, we apply the dynamical model to the intermediate solution xη t+1, i.e.,

xη t+1 = Φt( xη t+1),

and obtain the prediction for the next round. In the meta-algorithm (Algorithm 1), we only need to replace Algorithm 2 in Step 1 with Algorithm 5, and the rest is the same. The dynamic regret of the new algorithm is given below.

Algorithm 5 Ader: Expert-algorithm with dynamical models Require: The step size η, a sequence of dynamical models Φt( ) 1: Let xη 1 be any point in X 2: for t = 1, . . . , T do 3: Submit xη t to the meta-algorithm 4: Receive gradient ft(xη t ) from the meta-algorithm 5: xη t+1 = ΠX xη t η ft(xη t )

6: xη t+1 = Φt( xη t+1)

Theorem 5 Set

i = 1, . . . , N

where N = 1

2 log2(1 + 2T) + 1, α = p

8/(Tc2), and use Algorithm 5 as the expert-algorithm in Algorithm 1. Under Assumptions 1, 2, 3 and 4, for any comparator sequence u1, . . . , u T X, our proposed Ader method satisﬁes

t=1 ft(ut) 3G

T(D2 + 2DP T ) + c

2T 4 [1 + 2 ln(k + 1)]

T(1 + P T )

Theorem 5 indicates our method achieves an O( p

T(1 + P T )) dynamic regret, improving the O(

T(1 + P T )) dynamic regret of Hall and Willett [2013] signiﬁcantly. Note that when Φt( ) is the identity map, we recover the result in Theorem 3. Thus, the upper bound in Theorem 5 is also optimal.

4 Conclusion and Future Work

In this paper, we study the general form of dynamic regret, which compares the cumulative loss of the online learner against an arbitrary sequence of comparators. To this end, we develop a novel method, named as adaptive learning for dynamic environment (Ader). Theoretical analysis shows that Ader achieves an optimal O( p

T(1 + PT )) dynamic regret. When a sequence of dynamical models is available, we extend Ader to incorporate this additional information, and obtain an O( p

T(1 + P T )) dynamic regret.

In the future, we will investigate whether the curvature of functions, such as strong convexity and smoothness, can be utilized to improve the dynamic regret bound. We note that in the setting of the restricted dynamic regret, the curvature of functions indeed makes the upper bound tighter [Mokhtari et al., 2016, Zhang et al., 2017]. But whether it improves the general dynamic regret remains an open problem.

Acknowledgments

This work was partially supported by the National Key R&D Program of China (2018YFB1004300), NSFC (61603177), Jiangsu SF (BK20160658), YESS (2017QNRC001), and Microsoft Research Asia.

J. Abernethy, P. L. Bartlett, A. Rakhlin, and A. Tewari. Optimal stragies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory, pages 415 423, 2008.

O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations Research, 63 (5):1227 1244, 2015.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization with gradual variations. In Proceedings of the 25th Annual Conference on Learning Theory, 2012.

A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In Proceedings of the 32nd International Conference on Machine Learning, pages 1405 1411, 2015.

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

E. C. Hall and R. M. Willett. Dynamical models and tracking regret in online convex programming. In Proceedings of the 30th International Conference on Machine Learning, pages 579 587, 2013.

E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2 (3-4):157 325, 2016.

E. Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. Electronic Colloquium on Computational Complexity, 88, 2007.

E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151 178, 1998.

A. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan. Online optimization: Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics, pages 398 406, 2015.

K.-S. Jun, F. Orabona, S. Wright, and R. Willett. Improved strongly adaptive online learning using coin betting. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, pages 943 951, 2017.

A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In IEEE 55th Conference on Decision and Control, pages 7195 7201, 2016.

A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In Proceedings of the 26th Conference on Learning Theory, pages 993 1019, 2013.

S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011.

S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: primal estimated sub-gradient solver for SVM. In Proceedings of the 24th International Conference on Machine Learning, pages 807 814, 2007.

N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low-noise and fast rates. In Advances in Neural Information Processing Systems 23, pages 2199 2207, 2010.

T. van Erven and W. M. Koolen. Metagrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems 29, pages 3666 3674, 2016.

T. Yang, L. Zhang, R. Jin, and J. Yi. Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on Machine Learning, pages 449 457, 2016.

L. Zhang, J. Yi, R. Jin, M. Lin, and X. He. Online kernel learning with a near optimal sparsity bound. In Proceedings of the 30th International Conference on Machine Learning, 2013.

L. Zhang, T. Yang, J. Yi, R. Jin, and Z.-H. Zhou. Improved dynamic regret for non-degenerate functions. In Advances in Neural Information Processing Systems 30, pages 732 741, 2017.

L. Zhang, S. Lu, and Z.-H. Zhou. Adaptive online learning in dynamic environments. Ar Xiv e-prints, ar Xiv:1810.10815, 2018a.

L. Zhang, T. Yang, R. Jin, and Z.-H. Zhou. Dynamic regret of strongly adaptive methods. In Proceedings of the 35th International Conference on Machine Learning, 2018b.

M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928 936, 2003.