# dynamic_regret_of_convex_and_smooth_functions__dd9d09c3.pdf

Dynamic Regret of Convex and Smooth Functions

Peng Zhao, Yu-Jie Zhang, Lijun Zhang, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China {zhaop, zhangyj, zhanglj, zhouzh}@lamda.nju.edu.cn

We investigate online convex optimization in non-stationary environments and choose the dynamic regret as the performance measure, deﬁned as the difference between cumulative loss incurred by the online algorithm and that of any feasible comparator sequence. Let T be the time horizon and PT be the path-length that essentially reﬂects the non-stationarity of environments, the state-of-the-art dynamic regret is O( p

T(1 + PT )). Although this bound is proved to be minimax optimal for convex functions, in this paper, we demonstrate that it is possible to further enhance the dynamic regret by exploiting the smoothness condition. Speciﬁcally, we propose novel online algorithms that are capable of leveraging smoothness and replace the dependence on T in the dynamic regret by problemdependent quantities: the variation in gradients of loss functions, the cumulative loss of the comparator sequence, and the minimum of the previous two terms. These quantities are at most O(T) while could be much smaller in benign environments. Therefore, our results are adaptive to the intrinsic difﬁculty of the problem, since the bounds are tighter than existing results for easy problems and meanwhile guarantee the same rate in the worst case.

1 Introduction

In many real-world applications, data are inherently accumulated over time, and thus it is of great importance to develop a learning system that updates in an online fashion. Online Convex Optimization (OCO) is a powerful paradigm for learning in such a circumstance, which can be regarded as an iterative game between a player and an adversary. At iteration t, the player selects a decision xt from a convex set X and the adversary reveals a convex function ft : X 7 R. The player subsequently suffers an instantaneous loss ft(xt). The performance measure is the (static) regret [1],

S-Regret T =

t=1 ft(xt) min x X

t=1 ft(x), (1)

which is the difference between cumulative loss incurred by the online algorithm and that of the best decision in hindsight. The rationale behind such a metric is that the best ﬁxed decision in hindsight is reasonably good over all the iterations. However, this is too optimistic and may not hold in changing environments, where data are evolving and the optimal decision is drifting over time. To address this limitation, dynamic regret is proposed to compete with changing comparators u1, . . . , u T X,

D-Regret T (u1, . . . , u T ) =

t=1 ft(ut), (2)

which draws considerable attention recently [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. The measure is also called the universal dynamic regret (or general dynamic regret), in the sense that it gives a universal

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

guarantee that holds against any comparator sequence. Note that static regret (1) can be viewed as its special form by setting comparators as the ﬁxed best decision in hindsight. Moreover, a variant appeared frequently in the literature is the worst-case dynamic regret deﬁned as

D-Regret T (x 1, . . . , x T ) =

t=1 ft(x t ), (3)

which specializes the general form (2) by setting ut = x t arg minx X ft(x). However, the worst-case dynamic regret is often too pessimistic, whereas the universal one is more adaptive to the non-stationary environments. We refer the readers to [8] for more detailed explanations.

There are many studies on the worst-case dynamic regret [2, 3, 4, 5, 6, 7, 10, 13], but only few results are known for the universal dynamic regret. Zinkevich [1] shows that online gradient descent (OGD) achieves an O(

T(1 + PT )) universal dynamic regret, where PT = PT t=2 ut 1 ut 2 is the path-length of comparators u1, . . . , u T and thus reﬂects the non-stationarity of the environments. Nevertheless, there exists a large gap between this upper bound and the Ω( p

T(1 + PT )) minimax lower bound established recently by Zhang et al. [8], who further propose a novel online algorithm, attaining an O( p

T(1 + PT )) universal dynamic regret, and thereby close the gap.

Although the rate is minimax optimal for convex functions, we would like to design algorithms with more adaptive bounds, replacing the dependence on T by certain problem-dependent quantities that are O(T) in the worst case while could be much smaller in benign environments (i.e., easy problems). In the study of static regret, we can attain such bounds when additional curvature like smoothness is presented, including small-loss bounds [14] and variation bounds [15]. Thus, a natural question arises whether it is possible to leverage smoothness to achieve more adaptive universal dynamic regret?

Our results. In this paper, we provide an afﬁrmative answer by designing algorithms with problemdependent dynamic regret bounds. Speciﬁcally, we focus on the following two adaptive quantities: the gradient variation of online functions VT and the cumulative loss of the comparator sequence FT ,

t=2 sup x X ft 1(x) ft(x) 2 2, and FT =

t=1 ft(ut). (4)

We propose a novel online approach for convex and smooth functions, named Smoothness-aware online learning with dynamic regret (abbreviated as Sword). There are three versions, including Swordvar, Swordsmall, and Swordbest. All of them enjoy problem-dependent dynamic regret bound:

Swordvar enjoys a gradient-variation bound of O( p

(1 + PT + VT )(1 + PT ));

Swordsmall enjoys a small-loss bound of O( p

(1 + PT + FT )(1 + PT ));

Swordbest enjoys a best-of-both-worlds bound of O( p

(1 + PT + min{VT , FT })(1 + PT )).

Comparing to the minimax rate of O( p

T(1 + PT )), our bounds replace the dependence on T by the problem-dependent quantity PT + min{VT , FT }. Since the quantity is at most O(T), our bounds become much tighter when the problem is easy (for example when PT and VT /FT are sublinear in T), and meanwhile safeguard the same guarantee in the worst case. Therefore, our results are adaptive to the intrinsic difﬁculty of the problem and the non-stationarity of the environments.

Technical contributions. We highlight challenges and technical contributions of this paper. First, we note that there exist studies showing that the worst-case dynamic regret can beneﬁt from smoothness [5, 6, 13]. However, their analyses do not apply to our case, since we cannot exploit the optimality condition of comparators u1, . . . , u T , in stark contrast with the worst-case dynamic regret analysis. Therefore, we adopt the meta-expert framework to hedge the non-stationarity while keeping the adaptivity. We can use variants of OGD as the expert-algorithm to exploit the smoothness, but it is difﬁcult to design an appropriate meta-algorithm. Existing meta-algorithms and their variants either lead to problem-independent regret bounds or introduce terms that are incompatible to the desired problem-dependent quantity. To address the difﬁculty, we adopt the technique of optimistic online learning [16, 17], in particular Optimistic Hedge, to design novel meta-algorithms.

For Swordvar, we apply Optimistic Hedge with carefully designed optimism, which allows us to exploit the negative term in the regret analysis of Optimistic Hedge [17]. In this way, the meta-regret only

depends on the gradient variation. The construction of the special optimism is the most challenging part of our paper. For Swordsmall, the design of meta-algorithm is simple, and we directly use the vanilla Hedge, which can be treated as Optimistic Hedge with null optimism. Finally, for Swordbest, we still employ Optimistic Hedge as the meta-algorithm, but introduce a parallel meta-algorithm to learn the best optimism to ensure a best-of-both-worlds dynamic regret guarantee.

2 Gradient-Variation and Small-Loss Bounds

We ﬁrst list assumptions used in the paper, and then propose online algorithms with gradient-variation and small-loss dynamic regret bounds, respectively. All the proofs can be found in the full paper [18].

2.1 Assumptions

We introduce the following common assumptions that might be used in the theorems. Assumption 1. The norm of the gradients of online functions over the domain X is bounded by G, i.e., ft(x) 2 G, for all x X and t [T].

Assumption 2. The domain X Rd contains the origin 0, and the diameter of the domain X is at most D, i.e., x x 2 D for any x, x X. Assumption 3. All the online functions are L-smooth, i.e., for any x, x X and t [T],

ft(x) ft(x ) 2 L x x 2. (5)

Assumption 4. All the online functions are non-negative.

Note that in Assumption 4 we require the online functions to be non-negative outside the domain X, which is a precondition for establishing the self-bounding property for smooth functions [14]. Meanwhile, we treat double logarithmic factors in T as a constant, following previous studies [19, 20].

2.2 Gradient-Variation Bound

We design an approach in a meta-expert framework, and prove its gradient-variation dynamic regret.

2.2.1 Expert-Algorithm

In the study of static regret, Chiang et al. [15] propose the following online extra-gradient descent (OEGD) algorithm, and show that the algorithm enjoys gradient-variation static regret bound. The OEGD algorithm performs the following update:

bxt+1 = ΠX [bxt η ft(xt)] , xt+1 = ΠX [bxt+1 η ft(bxt+1)] , (6)

where bx1, x1 X, η > 0 is the step size, and ΠX [ ] denotes the projection onto the nearest point in X. For convex and smooth functions, Chiang et al. [15] prove that OEGD achieves an O( VT ) static regret. We further demonstrate that OEGD also enjoys gradient-variation type dynamic regret. Theorem 1. Under Assumptions 1, 2, and 3, by choosing η 1 4L, OEGD (6) satisﬁes

t=1 ft(ut) D2 + 2DPT

2η + ηVT + GD = O 1 + PT

for any comparator sequence u1, . . . , u T X.

Theorem 1 shows that it is crucial to tune the step size to balance non-stationarity (path-length PT ) and adaptivity (gradient variation VT ). Notice that the optimal tuning η = p

(D2 + 2DPT )/(2VT ) requires the prior information of PT and VT that are generally unavailable. We emphasize that VT is empirically computable, while PT remains unknown even after all iterations due to the fact that the comparator sequence is unknown and can be chosen arbitrarily as long as it is feasible in the domain. Therefore, the doubling trick [21] can only remove the dependence on the unknown VT but not PT .

To handle the uncertainty, we adopt the meta-expert framework to hedge the non-stationarity while keeping the adaptivity, inspired by the recent advance in learning with multiple learning rates [22,

23, 8]. Concretely, we ﬁrst construct a pool of candidate step sizes to discretize value range of the optimal step size, and then initialize multiple experts simultaneously, denoted by E1, . . . , EN. Each expert Ei returns its prediction xt,i by running OEGD (6) with a step size ηi from the pool. Finally, predictions of all the experts are combined by a meta-algorithm as the ﬁnal output xt to track the best expert. From the procedure, we observe that the dynamic regret can be decomposed as,

D-Regret T =

t=1 ft(ut) =

t=1 ft(xt) ft(xt,i)

| {z } meta-regret

t=1 ft(xt,i) ft(ut)

| {z } expert-regret

where {xt}t=1,...,T denotes the ﬁnal output sequence, and {xt,i}t=1,...,T is the prediction sequence of expert Ei. The ﬁrst part is the difference between cumulative loss of ﬁnal output sequence and that of prediction sequence of expert Ei, which is introduced by the meta-algorithm and thus named as meta-regret; the second part is the dynamic regret of expert Ei and therefore named as expert-regret.

The expert-algorithm is set as OEGD (6), and Theorem 1 upper bounds the expert-regret. The main difﬁculty lies in the design and analysis of an appropriate meta-algorithm.

2.2.2 Meta-Algorithm

Formally, there are N experts and expert Ei predicts xt,i at iteration t, and the meta-algorithm requires to produce xt = PN i=1 pt,ixt,i, a weighted combination of expert predictions, where pt N is the weight vector. It is natural to use Hedge [24] for weight update in order to track the best expert.

In order to be compatible to the gradient-variation expert-regret, the meta-algorithm is required to incur a problem-dependent meta-regret of order O( VT ln N). However, the meta-algorithms used in existing studies [23, 8] cannot satisfy the requirements. For example, the vanilla Hedge (multiplicative weights update) suffers from an O(

T ln N) meta-regret, which is problem-independent and thus not suitable for us. To this end, we design a a novel variant of Hedge by leveraging the technique of optimistic online learning with carefully designed optimism, speciﬁcally for our problem.

The optimistic online learning is developed by Rakhlin and Sridharan [16] and further expanded by Syrgkanis et al. [17]. For the prediction with expert advice setting, they consider that at the beginning of iteration (t + 1), in addition to the loss vector ℓt RN returned by the experts, the learner can receive a vector mt+1 RN called optimism. The authors propose the Optimistic Hedge algorithm [16, 17], which updates the weight vector pt+1 N by

s=1 ℓs,i + mt+1,i !

, i [N]. (7)

Syrgkanis et al. [17] prove the following regret guarantee for Optimistic Hedge. Lemma 1 ([17, Theorem 19]). The meta-regret of Optimistic Hedge is upper bounded by

t=1 pt, ℓt ℓt,i 2 + ln N

t=1 ℓt mt 2 1

t=2 pt pt 1 2 1, (8)

which holds for any expert i [N]. Denote by D = PT t=1 ℓt mt 2 to measure the adaptivity. With proper learning rate tuning, Optimistic Hedge enjoys an O( D ln N) meta-regret.

The optimistic online learning is very powerful for designing adaptive methods, in that the adaptivity D in Lemma 1 is very general and can be specialized ﬂexibly with different conﬁgurations of the feedback loss ℓt and optimism mt. Based on the Optimistic Hedge, we propose Variation Hedge, the meta-algorithm for Swordvar, by specializing Optimistic Hedge as follows:

the feedback loss ℓt is set as the linearized surrogate loss, namely, ℓt,i = ft(xt), xt,i ; the optimism mt is set with a careful design: for each i [N]

mt,i = ft 1( xt), xt,i , where xt =

i=1 pt 1,ixt,i. (9)

Algorithm 1 Swordvar: Meta (Variation Hedge)

Input: step size pool Hvar; learning rate ε

1: Initialization: i [N], p0,i = 1/N 2: for t = 1 to T do 3: Receive xt+1,i from expert Ei (ηi) 4: Update weight pt+1,i by (10)

5: Predict xt+1 = PN i=1 pt+1,ixt+1,i 6: end for

Algorithm 2 Swordvar: Expert (OEGD)

Input: step size ηi

1: Let bx1,i, x1,i be any point in X 2: for t = 1 to T do 3: bxt+1,i = ΠX bxt,i ηi ft(xt,i)

4: xt+1,i = ΠX bxt+1,i ηi ft(bxt+1,i)

5: Send xt+1,i to meta-algorithm 6: end for

So the meta-algorithm of Swordvar (namely, Variation Hedge) updates the weight by

s=1 fs(xs), xs,i + ft( xt+1), xt+1,i !

, i [N]. (10)

Algorithm 1 summarizes detailed procedures of the meta-algorithm, which in conjunction with the expert-algorithm of Algorithm 2 yields the Swordvar algorithm.

Remark 1. The design of optimism in (9) (in particular, xt) is crucial, and is the most challenging part in this work. The key idea is to exploit the negative term in the regret of Optimistic Hedge, as shown in (8), to convert the adaptive quantity D to the desired gradient variation VT . Indeed,

ℓt mt 2 (9)= maxi [N] ft(xt) ft 1( xt), xt,i 2

D2 ft(xt) ft 1( xt) 2 2 2D2( ft(xt) ft 1(xt) 2 2 + ft 1(xt) ft 1( xt) 2 2)

2D2 supx X ft(x) ft 1(x) 2 2 + 2D2L2 xt xt 2 2

where the last step makes use of smoothness. Therefore, D can be upper bounded by the gradient variation VT and the summation of xt xt 2 2. The latter one can be further expanded as

xt xt 2 2 =

i=1 (pt,i pt 1,i)xt,i 2

i=1 |pt,i pt 1,i| xt,i 2 2 D2 pt pt 1 2 1,

which can be eliminated by the negative term in (8), with a suitable setting of the learning rate ε.

2.2.3 Regret Guarantees

We prove that the meta-regret of Variation Hedge is O( VT ln N), compatible to the expert-regret.

Theorem 2. Under Assumptions 1, 2, and 3, by setting the learning rate optimally as ε = min{ p

1/(8D4L2), p

(2 + ln N)/(2D2VT )}, the meta-regret of Variation Hedge is at most

meta-regret 2D p

2VT (2 + ln N) + 4

2D2L(2 + ln N) = O( p

Note that the dependence on VT in the optimal learning rate tuning can be removed by the doubling trick. Furthermore, actually we can set the optimal learning rate of the meta-algorithm with b VT = PT t=2 ft(xt) ft 1(xt) 2 2 instead of the original gradient variation VT via a more reﬁned analysis. The quantity b VT can be regarded as an empirical approximation of VT , and it can be calculated directly without involving the inner problem of supx X ft(x) ft 1(x) 2 2. Thereby, we can perform the doubling trick by monitoring b VT with much less computational efforts. Combining Theorem 1 (expert-regret) and Theorem 2 (meta-regret), we have the following dynamic regret bound.

Theorem 3. Under Assumptions 1, 2, and 3, setting the pool of candidate step sizes Hvar as

ηi = 2i 1 r

2GT , i [N1]

where N1 = 2 1 log2(GT/(8D2L2)) + 1.1 Then Swordvar (Algorithms 1 and 2) satisﬁes

t=1 ft(ut) O p

(1 + PT + VT )(1 + PT )

for any comparator sequence u1, . . . , u T X.

Remark 2. Compared with the existing O( p

T(1 + PT )) dynamic regret [8], our result is more adaptive in the sense that it replaces T by the problem-dependent quantity PT + VT . Therefore, the bound will be much tighter in easy problems, for example when both VT and PT are o(T). Meanwhile, it safeguards the same minimax rate, since both quantities are at most O(T). Remark 3. Because the universal dynamic regret studied in this paper holds against any comparator sequence, it specializes the static regret by setting all comparators as the best ﬁxed decision in hindsight, i.e., u1 = . . . = u T = x arg minx X PT t=1 ft(x). Under such a circumstance, the path-length PT = PT t=2 ut 1 ut 2 will be zero, so the regret bound in Theorem 3 actually implies an O( VT ) variation static regret bound, which recovers the result of Chiang et al. [15].

2.3 Small-Loss Bound

In this part, we turn to another problem-dependent quantity, cumulative loss of the comparator sequence, and prove the small-loss dynamic regret. We start from the online gradient descent (OGD),

xt+1 = ΠX xt η ft(xt) . (12)

Srebro et al. [14] prove that OGD achieves an O(p F T ) static regret, where F T = PT t=1 ft(x ) is the cumulative loss of the comparator benchmark x . For the dynamic regret, since the benchmark is changing, a natural replacement is the cumulative loss of the comparator sequence u1, . . . , u T , namely FT = PT t=1 ft(ut). We show that OGD indeed enjoys such a small-loss dynamic regret.

Theorem 4. Under Assumptions 2, 3, and 4, by choosing any step size η 1 4L, OGD satisﬁes

t=1 ft(ut) D2 + 2DPT

2η(1 2ηL) + 2ηL 1 2ηL

t=1 ft(ut) = O 1 + PT

for any comparator sequence u1, . . . , u T X.

Similar to Swordvar, the step size needs to balance between non-stationarity (PT ) and adaptivity (FT , this time). Notice that the optimal tuning depends on PT and FT , both of which are unknown even after all T iterations. Therefore, we again compensate the lack of this information via the meta-expert framework to hedge the non-stationarity while keeping the adaptivity. The expert-algorithm is set as OGD. The meta-algorithm is required to suffer a small-loss meta-regret of order O( FT ln N). We discover that vanilla Hedge with linearized surrogate loss is qualiﬁed, which updates the weight by

pt+1,i exp ε

s=1 fs(xs), xs,i , i [N]. (13)

Notice that vanilla Hedge can be treated as Optimistic Hedge with null optimism, i.e., mt+1 = 0. Therefore, by Lemma 1 we know that its meta-regret is of order O( D ln N) and

t=1 maxi [N] ft(xt), xt,i 2 D2 T X

t=1 ft(xt) 2 2 4D2L

t=1 ft(xt), (14)

where the last inequality follows from the self-bounding property of smooth functions [14, Lemma 3.1]. As a result, the meta-regret is now O( p

F x T ln N), where F x T = PT t=1 ft(xt) is the cumulative loss of decisions. Note that the term F x T can be further processed to the desired small-loss quantity FT = PT t=1 ft(ut), the cumulative loss of comparators. We will present details in the proof.

To summarize, Swordsmall chooses OGD (12) as the expert-algorithm, and uses the vanilla Hedge with linearized surrogate loss as the meta-algorithm shown in the update form (13). The theorem below shows that the proposed algorithm enjoys the small-loss dynamic regret bound.

1The number of candidate step sizes is denoted by N1 instead of N to distinguish it with that of Swordsmall.

Theorem 5. Under Assumptions 1, 2, 3, and 4, setting the pool of candidate step sizes Hsmall as

ηi = 2i 1 r

D 16LGT , i [N2]

where N2 = 2 1 log2(GT/(DL)) + 1. Setting the learning rate of meta-algorithm optimally as ε = p

(2 + ln N2)/(D2F x T ), then Swordsmall satisﬁes

t=1 ft(ut) O p

(1 + PT + FT )(1 + PT ) .

for any comparator sequence u1, . . . , u T X.

Note that the optimal learning rate tuning requires the knowledge of F x T , which can be easily removed by doubling trick or self-conﬁdent tuning [25], since it is empirically evaluable at each iteration. Moreover, the O( p

(1 + PT + FT )(1 + PT )) universal dynamic regret in Theorem 5 specializes to the O( FT ) static regret [14] when setting the comparators as the ﬁxed best decision in hindsight.

3 Best-of-Both-Worlds Bound

In the last section, we propose Swordvar and Swordsmall that achieve variation and small-loss bounds respectively. Due to different problem-dependent quantities are involved, these two bounds are generally incomparable and are favored in different scenarios. Therefore, it is natural to ask for a best-of-both-worlds guarantee: the regret of the minimum of variation and small-loss bounds.

Table 1: Summary of expert-algorithms and meta-algorithms as well as different optimism used in the proposed algorithms (including three variants of Sword).

Method Expert Meta Optimism

Swordvar OEGD Variation Hedge by (9) Swordsmall OGD vanilla Hedge mt+1 = 0 Swordbest OEGD & OGD Optimistic Hedge by (18), (21)

To this end, we require a meta-algorithm that can enjoy both kinds of adaptivity to combine all the experts, with an O( p

min{VT , FT } ln N) meta-regret. Based on the observation that both Variation Hedge and vanilla Hedge are essentially special cases of Optimistic Hedge with different conﬁgurations of optimism, we adopt the Optimistic Hedge to be the meta-algorithm for Swordbest, where a parallel meta-algorithm is introduced to learn the best optimism for Optimistic Hedge to ensure best-of-both-worlds metaregret. In the following we describe the expert-algorithm and meta-algorithm of Swordbest.

Expert-algorithm. We aggregate the experts of Swordvar and Swordsmall, so there are N = N1+N2 experts in total and the step size of each experts is set according to the pool H = Hvar Hsmall (cf. (11) and (15) for deﬁnitions). The ﬁrst N1 experts run OEGD (6) with the step size chosen from Hvar, and the other N2 experts perform OGD (12) with step size speciﬁed by Hsmall. At iteration t, the ﬁnal output is a weighted combination of predictions returned by the expert-algorithms, namely,

i=1 pt,ixt,i =

i=1 pt,ixv t,i +

i=N1+1 pt,ixs t,i, (16)

where pt N1+N2 is the weight, xt,i = xv t,i for i = 1, . . . , N1 are predictions returned by the expert-algorithms (OEGD) of Swordvar, and xt,i = xs t,i for i = N1 + 1, . . . , N1 + N2 are predictions returned by the expert-algorithms (OGD) of Swordsmall. It remains to specify the meta-algorithm.

Meta-algorithm. We adopt the Optimistic Hedge algorithm along with the linearized surrogate loss as the meta-algorithm, where the weight vector pt+1 N1+N2 is updated according to

s=1 fs(xs), xs,i + mt+1,i !

where the optimism mt+1 RN1+N2. In order to facilitate the meta-algorithm with both kinds of adaptivity (VT and FT ), it is crucial to design best-of-both-worlds optimism.

Algorithm 3 Swordbest: Meta (Optimistic Hedge)

Input: step size pool H; learning rate ε

1: Let N = N1 + N2, i [N], p0,i = 1/N 2: for t = 1 to T do 3: Receive prediction xt+1,i from expert Ei 4: Set M v t+1 and M s t+1 by (19) and (20) 5: Update the weight βt+1 by (22) 6: Obtain the optimism Mt+1 (21) 7: Update the weight pt+1,i by (17) and (18) 8: Output the prediction xt+1 = PN i=1 pt+1,ixt+1,i 9: end for

Algorithm 4 Swordbest: Expert (OEGD & OGD)

Input: step size ηi

1: Let bx1,i, x1,i be any point in X 2: for t = 1 to T do 3: if i {1, . . . , N1} then 4: bxt+1,i = ΠX bxt,i ηi ft(xt,i)

5: xt+1,i = ΠX bxt+1,i ηi ft(bxt+1,i) . 6: else 7: xt+1,i = ΠX xt,i ηi ft(xt,i) . 8: end if 9: Send prediction xt+1,i to meta-algorithm 10: end for

We set the optimism mt+1 in the following way: for each i [N1 + N2] mt+1,i = Mt+1, xt+1,i , (18)

where Mt+1 Rd is called the optimistic vector. So we are left with the task of determining the term of Mt+1 in (18). Inspired by the seminal work of Rakhlin and Sridharan [16], we treat the problem of selecting the sequence of optimistic vectors as another online learning problem. The idea is to build a parallel meta-algorithm for learning the optimistic vector Mt+1, which is then fed to Optimistic Hedge of (17) for combining multiple experts, to achieve a best-of-both-worlds meta-regret.

Speciﬁcally, consider the following learning scenario of prediction with two expert advice. At the beginning of iteration (t + 1), we receive two optimistic vectors M v t+1, M s t+1 Rd, based on which the algorithm determines the optimistic vector Mt+1 Rd for Swordbest. Then the online function ft+1 is revealed, and we subsequently observe the loss of dt+1(M v t+1) and dt+1(M s t+1), where dt+1(M) = ft+1(xt+1) M 2 2. In above, the vectors of M v t+1 and M s t+1 are M v t+1 = ft( xt+1), and M s t+1 = 0, (19) where xt+1 is the instrumental output. Similar to the construction of (9), it is designed as

i=1 pt,ixv t+1,i + XN1+N2

i=N1+1 pt,ixs t+1,i. (20)

Notice that the function dt : Rd 7 R is 2-strongly convex with respect to 2-norm, we thus choose Hedge of strongly convex functions [26, Chapter 3.3] as the parallel meta-algorithm for updating, Mt+1 = βt+1M v t+1 + (1 βt+1)M s t+1, (21) where the weight βt+1 [0, 1] for learning optimistic vectors is updated by

βt+1 = exp( 2Dv t ) exp( 2Dv t ) + exp( 2Ds t ) (22)

with Dv t = Pt τ=1 dτ(M v τ ) and Ds t = Pt τ=1 dτ(M s τ ).

Algorithm 3 summarizes the meta-algorithm of Swordbest. In the last two columns of Table 1, we present comparisons of the meta-algorithms and optimism designed for different methods .

Regret Analysis. Recall that the meta-regret of Optimistic Hedge is of order O( D ln N). From the setting of surrogate loss (17) and optimism (18), we have

t=1 maxi [N] ( ft(xt) Mt, xt,i )2 D2 T X

t=1 ft(xt) Mt 2 2.

Besides, the regret analysis of Hedge for strongly convex functions [26, Proposition 3.1] implies

t=1 ft(xt) Mt 2 2 =

t=1 dt(Mt) min VT , FT + ln 2

where VT = PT t=2 ft(xt) ft 1( xt) 2 2 and FT = PT t=1 ft(xt) 2 2. The two terms can be further converted to the desired gradient variation VT and small loss FT , by exploiting the smoothness and expert-regret analysis. We can thus ensure the following meta-regret bound.

Theorem 6. Under Assumptions 1, 2, 3, and 4, by setting the learning rate optimally as ε = min{ p

1/(8D4L2), ε }, the meta-algorithm of Swordbest satisﬁes

meta-regret 2D q

(2 + ln N)(min{2VT , FT } + ln 2) + 4

2D2L(2 + ln N)

where ε = p

(2 + ln N)/(D2 min{2VT , FT } + D2 ln 2).

Because VT and FT are both empirically observable, we can easily get rid of their dependence in the optimal learning rate tuning. Also see the discussion below Theorem 2 about replacing the original gradient variation VT by its empirical approximation b VT = PT t=2 ft(xt) ft 1(xt) 2 2 to save computational costs. Besides, the FT term of meta-regret will be converted to the desired small-loss quantity FT in the ﬁnal regret bound. Combining above meta-regret analysis as well as the expert-regret analysis of OEGD and OGD algorithms, we can ﬁnally achieve the best of both worlds. Theorem 7. Under Assumptions 1, 2, 3, and 4, setting the pool of candidate step sizes as

H = Hvar Hsmall, (23)

where Hvar and Hsmall are deﬁned in (11) and (15). Then Swordbest (Algorithms 3 and 4) satisﬁes

t=1 ft(ut) O p

(1 + PT + min{VT , FT })(1 + PT ) ,

for any comparator sequence u1, . . . , u T X. Remark 4. The dynamic regret bound in Theorem 7 achieves a minimum of gradient-variation and small-loss bounds, and therefore combines their advantages and enjoys both kinds of adaptivity. Moreover, the result also implies an O( p

min{VT , FT }) static regret by setting the sequence of comparators to be the best ﬁxed decision in hindsight, where we note that FT is now the same as the notation of F T used below (12), the cumulative loss of the comparators benchmark.

4 Conclusion

In this paper, we exploit smoothness to enhance the universal dynamic regret, with the aim to replace the time horizon T in the state-of-the-art O( p

T(1 + PT )) bound by problem-dependent quantities that are at most O(T) but can be much smaller in easy problems. We achieve this goal by proposing two meta-expert algorithms: Swordvar which attains a gradient-variation dynamic regret bound of order O( p

(1 + PT + VT )(1 + PT )), and Swordsmall which enjoys a small-loss dynamic regret bound of order O( p

(1 + PT + FT )(1 + PT )). Here, VT measures the variation in gradients and FT is the cumulative loss of the comparator sequence. They are at most O(T) yet could be very small when the problem is easy, and thus reﬂect the difﬁculty of the problem instance. As a result, our bounds improve the minimax rate of universal dynamic regret by exploiting smoothness. Finally, we design Swordbest to combine advantages of both gradient-variation and small-loss algorithms and achieve a best-of-both-worlds dynamic regret bound of order O( p

(1 + PT + min{VT , FT })(1 + PT )). We note that all of attained dynamic regret bounds are universal in the sense that they hold against any feasible comparator sequence, and thus the algorithms are more adaptive to the non-stationary environments. In the future, we will investigate the possibility of exploiting other function curvatures, such as strong convexity or exp-concavity, into the analysis of the universal dynamic regret.

Acknowledgment

This research was supported by the National Key R&D Program of China (2018YFB1004300), NSFC (61921006, 61976112), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Baidu Scholarship. The authors would like to thank Mengxiao Zhang for helpful discussions. We are also grateful for the anonymous reviewers for their valuable comments.

Broader Impact

The work is mostly theoretical and the broader impact discussion is not applicable.

[1] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 928 936, 2003.

[2] Omar Besbes, Yonatan Gur, and Assaf J. Zeevi. Non-stationary stochastic optimization. Operations Research, 63(5):1227 1244, 2015.

[3] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization : Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 398 406, 2015.

[4] Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In Proceedings of the 55th IEEE Conference on Decision and Control (CDC), pages 7195 7201, 2016.

[5] Tianbao Yang, Lijun Zhang, Rong Jin, and Jinfeng Yi. Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 449 457, 2016.

[6] Lijun Zhang, Tianbao Yang, Jinfeng Yi, Rong Jin, and Zhi-Hua Zhou. Improved dynamic regret for non-degeneracy functions. In Advances in Neural Information Processing Systems 30 (NIPS), pages 732 741, 2017.

[7] Lijun Zhang, Tianbao Yang, Rong Jin, and Zhi-Hua Zhou. Dynamic regret of strongly adaptive methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 5877 5886, 2018.

[8] Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31 (Neur IPS), pages 1330 1340, 2018.

[9] Peter Auer, Yifang Chen, Pratik Gajane, Chung-Wei Lee, Haipeng Luo, Ronald Ortner, and Chen-Yu Wei. Achieving optimal dynamic regret for non-stationary bandits without prior information. In Proceedings of the 32nd Conference on Learning Theory (COLT), pages 159 163, 2019.

[10] Dheeraj Baby and Yu-Xiang Wang. Online forecasting of total-variation-bounded sequences. In Advances in Neural Information Processing Systems 32 (Neur IPS), pages 11071 11081, 2019.

[11] Jianjun Yuan and Andrew G. Lamperski. Trading-off static and dynamic regret in online least-squares and beyond. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 6712 6719, 2020.

[12] Peng Zhao, Guanghui Wang, Lijun Zhang, and Zhi-Hua Zhou. Bandit convex optimization in non-stationary environments. In Proceedings of the 23rd International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1508 1518, 2020.

[13] Peng Zhao and Lijun Zhang. Improved analysis for dynamic regret of strongly convex and smooth functions. Ar Xiv preprint, ar Xiv:2006.05876, 2020.

[14] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems 23 (NIPS), pages 2199 2207. 2010.

[15] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Proceedings of the 25th Conference On Learning Theory (COLT), pages 6.1 6.20, 2012.

[16] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Proceedings of the 26th Conference On Learning Theory (COLT), pages 993 1019, 2013.

[17] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems 28 (NIPS), pages 2989 2997, 2015.

[18] Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. Ar Xiv preprint, ar Xiv:2007.03479, 2020.

[19] Dmitry Adamskiy, Wouter M. Koolen, Alexey V. Chernov, and Vladimir Vovk. A closer look at adaptive regret. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory (ALT), pages 290 304, 2012.

[20] Haipeng Luo and Robert E. Schapire. Achieving all with no parameters: Ada Normal Hedge. In Proceedings of the 28th Annual Conference Computational Learning Theory (COLT), pages 1286 1304, 2015.

[21] Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427 485, 1997.

[22] Pierre Gaillard, Gilles Stoltz, and Tim van Erven. A second-order bound with excess losses. In Proceedings of The 27th Conference on Learning Theory (COLT), pages 176 196, 2014.

[23] Tim van Erven and Wouter M. Koolen. Metagrad: Multiple learning rates in online learning. In Advances in Neural Information Processing Systems 29 (NIPS), pages 3666 3674, 2016.

[24] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

[25] Peter Auer, Nicolò Cesa-Bianchi, and Claudio Gentile. Adaptive and self-conﬁdent on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48 75, 2002.

[26] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.